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ABSTRACT 


The Department of Defense struggles to develop and maintain cutting-edge software through 
the Defense Acquisition System. The pace of improvements in machine learning algorithms 
and software suggests the organization will fail to rapidly develop systems incorporating 
the latest innovations to meet its intelligence-related media analysis needs. In contrast, 
the trend of industry and academia releasing algorithms and software under permissive 
licenses bestows defense organizations with an opportunity. These groups can potentially 
overcome resource shortfalls and long acquisition timelines by implementing machine 
learning solutions with open-source software. We test this hypothesis by employing an 
open-source software library to evaluate publicly available deep learning algorithms on 
three prior defense-related datasets. We then compare performance of deep convolutional 
neural networks to past methods for detecting AK-47s, ships, and screenshots in images. 
Applying proven algorithms through the software framework, we test three object detectors 
that exceed or match classification performance for all three experiments in a third of the 
development time available to designers of the previous algorithms. We relate these tests 
to defense scenarios in order to provide a logical argument and empirical measure of the 
utility of open-source machine learning frameworks to meet the Department of Defense’s 
intelligence-related media analysis needs. 
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CHAPTER 1: 

Introduction 


Today, thorough media analysis for the purpose of gathering intelligence proves time con¬ 
suming and expensive. Human fascination with capturing and sharing experiences, the 
arm’s reach availability of modern cameras, and the increasing amount of data posted to 
social media daily, feed a boundless flow of new images and videos. “Clara,” a notional 
intelligence analyst, must wade through this flow of information searching for the critical 
imagery that will give her team the necessary tip to prevent the next homegrown terror 
attack. She sorts through her photographs and video stills diligently, fast enough to get 
through her minimum for the day, yet methodically enough to be sure to tag all relevant 
objects in an image for further analysis. Just one of many, Clara’s efforts aggregate with 
thousands of others who stand watch over their nation hoping to find actionable intelligence 
in the day’s social media postings. Without accurate and efficient computer algorithms to 
help, organizations within the U.S. Department of Defense (DOD) are left training human 
beings to complete repetitive and headache-inspiring image-tagging tasks. Fortunately, re¬ 
cent improvements in modern computer algorithms for complex image classification tasks 
elevate computer performance to a level comparable with humans; Russakovsky et al. [1] 
find that their best expert annotator outperforms modern algorithms by only 1.7%. For 
the DOD, this suggests computers can assist with, or take over, many of the organization’s 
intelligence-related media analysis tasks today. 

1.1 Current Image Classification Environment 

To support the claim that computers stand ready to assist in these tasks, we explore recent 
developments in the image classification research environment in terms of the available 
algorithms, hardware, and software. For this thesis, image classification will refer to 
the automated process of labeling an image with single or multiple labels to describe 
what objects appear in that image. In the past decade, computer algorithms have gone 
from achieving noteworthy performance on simple image classification tasks to competing 
with human performance on complex tasks, such as labeling objects in an image into 
1,000 possible categories for the ImageNet Large Scale Visual Recognition Challenge 
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(ILSVRC) [1]. Specifically, recent breakthroughs in the field of deep learning neural 
networks reduced the error rates of state-of-the-art systems by 30% to 50% [2] when 
completing tasks like those included in the ILSVRC. The most successful models harness 
deep convolutional neural networks (CNNs) for their architectures. They also take advantage 
of graphics processing units (GPUs) to speed up computations for training the networks by 
one or two orders of magnitude [2]. While the code and intricate algorithm architectures of 
such cutting-edge tools is beyond what an inexperienced programmer typically can achieve, 
the recent trend of industry and academia making machine learning software available via 
open-source licenses brings access to this technology to even novice programmers. Further, 
well-known companies like Google and Microsoft have already open sourced their machine¬ 
learning tools, TensorFlow and Microsoft Cognitive Toolkit (CNTK), respectively. Table 1.1 
is a list of some of the major frameworks currently offered as open-source software (OSS). 
The progress in research and the availability of the source code for powerful frameworks 
should inspire reflection for large government organizations like the DOD. Is the DOD 
successfully developing similar technology to provide capabilities like autonomous image 
analysis and should the organization pursue available OSS solutions? 


Table 1.1. Sample of open-source deep learning software frameworks 


Organization 

Framework 

First Released 

Online Community 

Torch [3] 

Oct 2002 

Academic Community 

scikit-learn [4] 

Feb 2010 

University of Montreal LISA Lab 

Theano [5] 

Mar 2010 

Berkley Vision and Learning Center 

Caffe [6] 

Oct 2013 

Google 

TensorFlow [7] 

Nov 2015 

Academic Community 

MXNet [8] 

Dec 2015 

Microsoft 

CNTK [9] 

Jan 2016 


1.2 Using an Open-Source Deep Learning Software 
Library in the DOD 

With the volume and variety of available machine learning algorithms that are continually 
improving, compounded with the challenges large government groups face in maintaining 
their software, organizations within the DOD seem unlikely to rapidly develop and employ 
cutting-edge machine learning systems through the Defense Acquisition System. While 
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the U.S. Government is certainly capable of undertaking complex software projects, they 
are often costly and software development continues to prove problematic [10]. Potentially 
limiting factors come to mind when considering government software acquisition. To name 
a few: 


• Budget and contract constraints 

• Complexity of the Defense Acquisition System 

• Inability to modify source code behind a vendor produced system [11] 

• Long contractual and acquisition timelines 

• Limited machine learning talent pools 

These limitations inspire the sentiment that software contracted for independent develop¬ 
ment by a DOD organization is more likely second rate, quickly obsolete, and too inflexible 
to take advantage of new breakthroughs in the field of deep learning. 

On a positive note, current DOD policies towards OSS acknowledge the aforementioned 
limitations and encourage the consideration and application of open-source solutions when¬ 
ever they are available. Specifically, the DOD Chief Information Officer (CIO) published a 
memorandum providing guidance to encourage market research of OSS solutions to meet 
mission needs in 2009 [11]. This and other policies facilitate the DOD employing state- 
of-the-art deep learning software through OSS frameworks. Given the current machine 
learning environment and a supportive stance by the DOD on OSS, we propose an eval¬ 
uation of Google’s TensorFlow as a viable solution to meet the DOD’s machine learning 
needs for automated image classification tasks. 

1.3 Why TensorFlow? 

After a simple comparison of capabilities, we chose Google’s TensorFlow to conduct an 
evaluation of a single framework through several experiments instead of comparing available 
frameworks across a single task. With numerous deep learning software frameworks 
to choose from and a desire to avoid benchmark testing, we select TensorFlow for our 
experiments and leave the evaluation of other frameworks for future work. 

The decision to evaluate TensorFlow requires a brief discussion of the software’s background 
and the factors supporting its selection. In November of 2015, Google released TensorFlow, 
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“an interface for expressing machine learning algorithms, and an implementation for execut¬ 
ing such algorithms” [7] under the Apache 2.0 open-source license. Google developed its 
GoogLeNet neural network utilizing TensorFlow, winning the ILSVRC 2014 [7], the current 
image classification benchmark competition. The company also continued to make more 
accurate iterations of this trained network, codenamed Inception [12], [13], easily available 
for testing through the software framework. Above offering effective models, TensorFlow 
carries with it the financial support and technical talent of Google to keep the software im¬ 
proving. A confidence in Google’s ability to push the limits of artificial intelligence through 
machine learning quickly inspired an online community, through GitHub [14], to partner 
with the company to improve TensorFlow. Figure 1.1 depicts the number of updates made 
to the software by Google employees and this online community over TensorFlow’s first 
year as an open-source project. In summary, TensorFlow’s implementation of cutting-edge 
machine learning algorithms, documented state-of-the-art performance in image classifica¬ 
tion tasks, and its swift adoption by the open-source community indicate that TensorFlow 
will stay relevant and state-of-the-art for the foreseeable future. 



© 11,001 commits 18 branches 015 releases H 510 contributors ^ Apache-2.0 


Figure 1.1. Number of commits made to the TensorFlow software library 
during its first year as an open-source library. A commit includes any in¬ 
stance of additions or subtractions to the code base accepted by the project’s 
administrators. For TensorFlow, this includes addition of new models, inter¬ 
faces, and tools for the software. TensorFlow had 15 versions released in its 
first year, marking several significant and rapid improvements [14]. Adapted 
from [14]. 
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1.4 A Vision for TensorFlow-Based Intelligence-Related 
Media Analysis across the DOD 

This section develops our vision for the applications of TensorFlow (or another OSS frame¬ 
work from Table 1.1) with respect to the challenges of modern intelligence-related media 
analysis in the DOD. Starting at the tactical level, a team of intelligence analysts at a battal¬ 
ion or squadron could benefit from a software solution like TensorFlow. Current methods 
require personnel to manually sort through thousands of images and hours of video to pull 
out applicable information from confiscated hard drives or collected intelligence, surveil¬ 
lance, and reconnaissance (ISR) [15]. What if analysts instead simply searched through 
images or stills from video, first tagged by a TensorFlow-based algorithm, that were specific 
to the threat they were looking for, like weapons that could indicate the militarization of an 
owner of a hard drive? Beyond a single unit, all four service branches would have some¬ 
thing to gain from an intelligent unmanned autonomous system (UAS) that implements 
TensorFlow solutions. Most modern UASs simply broadcast ISR imagery back to a central 
location, consuming large amounts of resources in a constrained bandwidth environment 
in the process, and then require a human-in-the-loop for analysis. What if a surface-based 
UAS, far in front of a carrier strike group, passively conducted video surveillance and only 
broadcast imagery back across a satellite link when it determined that collected imagery 
contained a threat? Finally, recall the hypothetical analyst Clara, who works at a national 
level intelligence organization: a single analyst cannot possibly examine enough images 
from a day’s typical posting to allow her agency to manage the flood of media created daily. 
Thus, the risk of missing warnings of terrorism, and threats against the citizens the agency 
supports, remains high. What if her department instead fed this flow of imagery through 
compute resources, efficiently implementing cutting-edge machine learning algorithms in 
TensorFlow, and pulled out images of interest as determined by current threat profiles. An¬ 
alysts like Clara could then carefully examine these manageable streams of vetted images 
to pull out actionable information. This vision only touches upon a handful of use cases 
for TensorFlow in the DOD in a single problem domain. Even still, the potential benefit 
for successfully implementing the software in the organization sufficiently motivates our 
research to determine whether such use cases are feasible. 
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1.5 Problem Statement 

As illustrated above, training humans to examine the intimidating number of images avail¬ 
able for DOD intelligence analysis—quickly enough to ensure the information remains 
relevant—requires resources that many groups in the organization simply do not possess. 
These groups, from strategic-level national intelligence agencies to tactical-level military 
units, continually face time and manpower constraints in tackling media analysis challenges. 
In contrast to manpower, through the selection of cloud-based services offered commer¬ 
cially, the cost of compute resources continues to decline. Their affordability, coupled with 
the trend of commercial and academic researchers open-sourcing cutting-edge algorithms 
and software libraries, presents groups throughout the DOD with an opportunity: combine 
cheap compute resources with powerful state-of-the-art software to overcome time and 
personnel shortfalls. Before the DOD can exploit this opportunity, preliminary research is 
necessary to explore the merits of open-source machine learning software and algorithms 
when applied to specific DOD media analysis challenges. Through this thesis we set out to 
provide a starting point for future research and a baseline of support for the application of 
these open-source solutions in the Department of Defense. 

1.6 Research Questions 

To further describe the goals of our research and offer specifics of what we set out to 
accomplish, we present six research questions below. We will answer these questions 
through comparing the performance of algorithms implemented in TensorFlow to three 
previous Naval Postgraduate School (NPS) theses for identifying AK-47s [15], ships [16], 
and screenshots [17] in image databases. 

1. Compared to previous methods, do TensorFlow-based deep CNNs: improve classifi¬ 
cation error rate; reduce training time and the size of datasets required for training; 
provide versatility and ease of use in different problem domains; and scale to enable 
deployment of trained networks for classification of images in resource-constrained 
environments? 

2. Do TensorFlow’s pretrained Inception models offer higher out-of-the-box recall and 
lower false positive rate (FPR) than previous methods to detect AK-47s and can they 
perform classification rapidly in a resource-constrained environment? 
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3. How does TensorFlow’s pretrained Inception-v3 model perform in detecting ships in 
images when retrained with transfer learning methods compared to the error rate of 
previous methods? 

4. How does a deep CNN, built and trained through TensorFlow with images acquired 
from an internet database, perform in identifying screenshots in an image in compar¬ 
ison to previous methods in terms of recall and FPR? 

5. What are the potential benefits and risks of the DOD employing an open-source, 
continually updated, and public platform for its machine learning needs? 

6. How are the results of this analysis relevant to the DOD and future-related research 
utilizing TensorFlow or other open-source frameworks? 

1.7 The Way Ahead 

Having introduced the problem and the motivation of our research, establishing the way- 
points for answering our research questions seems helpful. In Chapter 2, we will discuss 
acquiring open-source software through the Defense Acquisition System, three previous 
NFS theses that will serve as our baseline for performance comparison, and modern al¬ 
gorithms and techniques we will implement in our research. Chapter 3 establishes the 
methodology for our three experiments and Chapter 4 provides their implementation details 
and the processes followed to conduct the experiments. With our tests complete, we present 
and discuss our results in Chapter 5. Finally, in Chapter 6 we draw conclusions from our 
results and specifically answer our research questions. The end state of this thesis is to 
present TensorFlow for employment by the DOD through introducing supporting policies, 
discussing the software’s key terminology and capabilities, and exploring three potential 
use cases by comparing performance to previous methods. In accomplishing these three 
objectives, we hope to provide a logical and empirical argument for further application of 
open-source machine learning software in intelligence-related media analysis. 
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CHAPTER 2: 
Background 


This chapter establishes the necessary background for our research by surveying four areas. 
First, Section 2.1 describes the Defense Acquisition System (DAS), along with software 
development and commercial software acquisition in this system, to garner support for 
applying OSS in the DOD. Next, Section 2.2 presents previously conducted experiments at 
NFS relating to computer-based object detection in images. These experiments serve as a 
starting point for performance comparison and potential applications of TensorFlow within 
the DOD. Section 2.3 provides a brief overview of CNNs and three modern algorithms that 
implement them in order to achieve state-of-the-art performance on complex classification 
problems. A final section explores some of the methods available to train these algorithms 
on new or organization specific image classification tasks. Through these sections, we 
demonstrate that an environment exists that encourages the acquisition of open-source 
solutions, like TensorFlow, to meet the DOD’s needs for object detection in images. 

2.1 Software Development and Commercial Software 
Acquisition in the DAS 

The Department of Defense has a system in place to develop and deploy new products, 
including software, to its members. The system is known as the Defense Acquisition System 
or DAS. This section introduces the DAS as it would relate to the DOD’s procurement and 
the contractor development of a proprietary equivalent to TensorFlow. We aim to show that 
developing such software through the system proves less efficient than simply incorporating 
an open-source offering. Understanding the challenges associated with developing cutting- 
edge software through the DAS requires an overview of the system and its past performance 
in completing complex projects. These challenges are well known and inspired the DOD 
to develop a preference for finding commercially available solutions to meet capability 
requirements. Moreover, recent defense-policy shifts that rebrand OSS as commercial 
software encourage pursuit of OSS products to meet the Department of Defense’s needs. 
Finally, an analysis of the benefits and risks of incorporating OSS in defense systems gives 
confidence in the merit of considering such software before other options. 
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2.1.1 Overview of the DAS 

To start, we present a definition of the DAS from its defining directive: “the management 
process by which the Department of Defense provides effective, affordabie, and timeiy 
systems to the users” [18, p. 4]. The DAS is one of three components that make up 
the overarching system known as “Big A” acquisition [19, Ch. 1-3.2], which guides the 
incorporation of new products into the DOD. Since the other two components focus 
on top-level policy and budgeting, our remaining discussion will focus on the DAS, the 
process by which a software system progresses from need identification to delivery to DOD 
users [20]. Figure 2.1 provides the highest level overview of the DAS. It consists of 
a series of required documents that drive three milestone decisions for moving a system 
towards production and deployment. The current DAS is a product of numerous rules and 
regulations, added between 1947 and the present by congressional reform [21]. While the 
system provides structure and order to a challenging and fluid development process for 
complex weapon systems, the volume and variety of requirements can prove overwhelming. 
An overarching purpose behind the DAS, and its many reforms, is to minimize the loss 
of taxpayer dollars in developing DOD systems. The process achieves this goal through 
increased accountability and risk mitigation. These characteristics, coupled with over¬ 
regulation, stifle rapid innovation. Although well intentioned, and even necessary, the 
system of complex regulations that make up the DAS becomes an added source of friction 
in an already complex product-development process. 

2.1.2 Frustrations in Software Development through the DAS 

The attributes described in Section 2.1.1 contribute to a negative sentiment towards pursuing 
software intensive projects through the DAS. Speaking of the numerous studies conducted 
to identify problems and suggest reform to acquisition processes, Linda Levine and Bill 
Novak of the Software Engineering Institute at Carnegie Mellon University summarize 
these frustrations well: “It is disturbing, however, that many problems associated with the 
development and acquisition of software-intensive systems remain unresolved-and growing 
in magnitude-while proposed solutions remain either untried or not sustained” [10, p. 1]. 
Others suggest that the system, conceptualized and established during World War II, is 
simply too outdated for the demands of 21st Century development projects, regardless 
of reforms [22]. Such sentiments have led to policy shifts that encourage acquisition of 
commercially produced systems, including commercial software, before development of 
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Figure 2.1. 
Source: [20]. 


Overview of the Defense Acquisition System process. 


new systems through the DAS. For example, the latest version of Department of De¬ 
fense Direetive (DODD) 5000.01, whieh “provides management prineiples and mandatory 
policies and procedures for managing all acquisition programs” [18], explicitly states that 
DOD Components shall consider commercially available products prior to other acquisi¬ 
tion options [18]. With supportive policy in place to acquire commercial software, the 
determination of whether or not such policies apply to OSS remains. 

2.1.3 Using OSS in the DOD 

This section shows that the DOD considers OSS a commercial product for purposes of 
acquisition and also discusses the benefits and risks of implementing OSS in defense 
systems. For the average DOD user or program manager, deciphering the differences 
among the types of software available (i.e., commercial, freeware, shareware, open-source, 
closed-source) and what variants they can use with government systems proves perplexing. 
Fortunately, policy exists that clears up this confusion for the defense professional hoping 
to incorporate an OSS solution into an acquisition project. In 2009 the CIO of the DOD 
characterized OSS as equivalent to commercial computer software and directed that defense 
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programs include it in the mandated market researeh of eommereially available solutions 
diseussed in Seetion 2.1.2 [11, p. 4, par 2.b]. This poliey grants permission for the 
aequisition professional to implement OSS in defense programs using the same regulations 
that govern elosed-souree software. 

With permission established, we will now weigh some benefits and risks of OSS. Speeif- 
ieally, we examine the software support available and eost, the speed of updates and 
improvements, and the seeurity and long-term use of sueh software. Open-souree is not 
synonymous with free. Often sueeessful OSS projeets, like Linux Red Hat, offer support 
eontraets that enterprise level users ean purehase, even though the souree eode for the 
software is available publiely. A signifieant risk for the DOD implementing OSS is that 
many projeets do not provide this purehasable support paekage. While a responsive online 
eommunity, thorough doeumentation, and hirable third party support ean remedy this risk, 
the level of support may fall short of the what eompanies like Mierosoft offer to the DOD 
for elosed-souree software. In eases that deem a maximum level of support essential, it is 
possible to hire a third-party who is familiar with the OSS eode base and applieations. The 
risk assoeiated with redueed support is aeeeptable in many oases due to the prospeot for eost 
savings. As ourrent DOD poliey states: “Sinee OSS typieally does not have a per-seat li- 
oensing eost, it ean provide a eost advantage in situations where many oopies of the software 
may be required, and ean mitigate risk of eost growth due to lioensing in situations where 
the total number of users may not be known in advanee” [11, p. 4]. Also, onee the DOD 
establishes guidanoe for employing an OSS projeet, other eomponents of the organization 
ean quiekly adopt the software, sinee the aequisition of per-instanee lieenses is no longer 
a eoneern. Despite support eoncerns, the eost savings and speed at whieh organizations 
aeross the DOD ean share software solutions makes OSS an attraetive alternative to other 
forms of eommereial software. 

The DOD desires innovation but struggles to aequire innovative systems. OSS ean help with 
faster update eyeles and the ability to tailor souree eode for speeifie needs and mission types. 
As illustrated by the fifteen releases of TensorFlow in its first year, see Figure 1.1, popular 
OSS projeets attain release eyeles measurable in terms of weeks or months eompared to 
the annual or bi-annual releases of other types of eommereial software. The short update 
eyele of OSS, eoupled with the faet that new versions do not require purehase of updated 
lieenses, affords the defense seetor an opportunity for rapid and eost effieient prototyping 
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of new software solutions [11]. Further, the varying missions and threats faeed by DOD 
Components demand software that users or support personnel ean modify swiftly and 
apply to new problems or adapting enemies. The DOD CIO summarized this benefit 
well in his 2009 memorandum: “The unrestrieted ability to modify software souree eode 
enables the Department to respond more rapidly to ehanging situations, missions, and future 
threats” [11, p. 4]. But what are the risks? With a eommunity of users produeing rapid 
updates, the potential for the introduction of code breaking bugs exists. Fortunately, modern 
software testing techniques and issue reporting services offered through websites hosting 
open-source projects, like GitHub [14], largely alleviate this risk. These techniques, the 
number of reviewers and testers of popular projects, and the ability to delay implementing 
the new releases of software for critical defense systems, allow an organization to manage 
the unique uncertainties of relying on OSS. With benefits of faster updates and the ability 
to directly modify source code outweighing other concerns, OSS provides the DOD with 
an opportunity to maintain state-of-the-art software solutions across its many mission sets. 

We must also weigh benefits and risks in terms of security and the prospect for long¬ 
term use of OSS products. One unique concern of OSS is that an adversary, hoping 
to find a vulnerability to attack a defense system, has access to the source code that 
drives the system. While the dangers of such access seem obvious, perhaps it is most 
surprising that some experts argue that OSS leads to improved security over closed-source 
software. This argument’s merit stems from the fact that benevolent users, such as DOD 
security professionals, also have access to the source code, instead of only the private 
firms developing the software. This access promotes efficient discovery of security flaws, 
which an open-source community or a DOD organization can quickly patch and release. In 
contrast, the license agreement of closed-source software would make such modification 
and distribution illegal [23]. Moving on to long-term use, some opponents may argue that a 
group maintaining an open-source project can stop updating the source code at anytime, thus 
introducing risk in relying on such software. To counter this argument, first consider the 
repercussions of a vendor owning legal rights to the source code of a defense system going 
out of business. Next, think of a community dropping support for an open-source project, 
of which the DOD maintains legal access to any copy of the source code it possesses [23]. 
In the former scenario, the DOD holds few options beyond running the current version of 
the software until paying a new vendor to develop a comparable product from scratch. In 
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the latter eireumstance, DOD users ean assess their need for further updates of the OSS 
before deciding whether or not to internally maintain a version of the software or contract 
out a vendor to pick-up where the open-source community left off. This right to update and 
modify the source code frees the DOD from vendor lock and provides greater opportunity 
for long-term use and improvement of OSS products. 

With updated policy in the DAS and from DOD leadership in support of the pursuit of 
OSS, defense professionals have the necessary permission to implement OSS solutions to 
meet their operational needs. The frustrations of developing complex software through 
the DAS provides the motivation to seek OSS solutions prior to software development 
with the system. Finally, an analysis of top-level benefits and risks of engaging OSS in 
defense suggests its suitability for many defense systems. Overall, this assessment inspires 
confidence in the DOD’s ability to acquire, implement, maintain, and benefit from an OSS 
library like TensorFlow. 

2.2 Previous NPS Image Classification Experiments 

In addition to weighing the benefits of OSS and presenting the ability of the DOD to employ a 
solution like TensorFlow, we aspire to display empirical evidence that the software performs 
in realistic DOD use cases. Following the AK-47, ship, and screenshot detector scenarios 
developed in Section 1.4, we introduce three previous experiments at NPS which evaluate 
machine learning algorithms on datasets related to our scenarios. We eventually incorporate 
these datasets into our experiments and employ the results from the prior tests to serve as 
a baseline for comparing TensorFlow’s performance on the same image sets. These three 
experiments all rely on human-engineered feature extractors specific to each object detection 
problem. While their three algorithms contain some similarities, the methods are mostly 
unique to each experiment. The discussion of this varying methodology in this section 
stands in contrast to the single algorithm type, a deep CNN, and a single software library 
that we apply across all three datasets for our experiments. Finally, we also include any 
training or evaluation time metrics reported in the previous theses to compare to our tests. 
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2.2.1 Detection of AK-47s by Justin Jones 

The first experiment we reference simulates the potential use case for an intelligence analyst 
at a battalion or squadron level discussed in Section 1.4. Justin Jones [15] published the 
results of this experiment at NFS in September of 2010. Starting with 18 videos containing 
AK-47s, Jones [15] developed a training set and test set of images by capturing stills from 
these videos. His training set consisted of 1,146 stills containing AK-47s from 13 of the 
videos and 5,660 negative training images from the internet. His test set contains 687 
stills of AK-47s from five videos and 7,045 images containing people and other objects 
without AK-47s present. To conduct his experiment, Jones [15] implemented both whole 
AK-47 and left and right parts-based classifiers with three algorithms. First, he trained 
several iterations of a Viola-Jones Classifier to pull out features from small portions of 
the images with the Haar training utility from OpenCV [24], an open-source computer 
vision software library. He then fed the extracted features from an image segment into a 
support vector machine (SVM) and a simple artificial neural network, known as a multilayer 
perceptron (MLP), to determine if an AK-47 existed in the segment. Repeating this process 
for over 300,000 unique windows across an image, he produced an AK-47 count for each 
video still. Table 2.1 contains training resources available and time required for each 
algorithm Jones [15] tested. 


Table 2.1. Training time statistics from Jones’ [15] thesis. 


Algorithm 

Processor 

RAM 

Training Time 

Viola-Jones Classifiers 

Intel Core 2, 2.4 GHz 

2GB 

48 hours 

SVM 

Intel Core 2, 2.0 GHz 

4GB 

2 seconds 

MLP 

Intel Core 2, 2.0 GHz 

4GB 

2 seconds 


Jones [15] provided the times in this table for training the components of his 
algorithms on his 6,806 total positive and negative training images. 


Jones [15] measured the performance of his algorithms through a comparison of recall and 
FPR on the test image set described above. He presented this information in the form of a 
receiver operating characteristic (ROC) curve and Figure 2.2 displays his overall results for 
his algorithms. This curve will serve as the basis for our performance comparison. 
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Figure 2.2. Jones’ [15] AK-47 detector performance. Notably, Jones’ [15] 
scale for his FPR is at 10“"^. His algorithm produced a classification at over 
300,000 window positions across each frame to predict how many AK-47s 
existed in an image. The importance of limiting false positives with such 
a large number of classifications per image led Jones’ [15] to only display 
results with an extremely low FPR. Source: [15, Figure 5.8]. 


2.2.2 Detection of Ships by David Camp 

The second experiment we reference simulates the use case for intelligent UASs discussed 
in Section 1.4. David Camp published the results of this experiment at NPS in December 
of 2013 [16]. Camp’s archived training image set from this experiment consists of 110 
instances of ships and 97 instances of sky, sea, and coastline with and without buildings that 
do not contain ships. These numbers vary slightly from his reported 105 images for both 
categories. His test setpossesses 405 images containing ships and 100 images not containing 
ships, similar to the training set. Camp scaled all test images at eight levels, from full scale to 
five percent scale, to simulate classification at different distances from a sensor. He applied 
several feature extractors and methods for classification to determine which provided the 
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strongest performance, but for the sake of brevity we only discuss his top two classifiers. 
The first is what Camp labeled a HYBRID detector, combining histogram of oriented 
gradients (HOG) and bag-of-words (BOW) algorithms into one classifier. The second was 
a deformable parts model (DPM). Like Jones, Camp also employed the OpenCV software 
library for all of his experiments. To measure performance. Camp calculated recall and 
FPR to produce a ROC curve. Table 2.2 presents information about runtime performance 
of the algorithms while Figure 2.3 presents his ROC curves for comparison of classification 
performance. 

Table 2.2. Evaluation runtime statistics from Camp’s [16] thesis for a single 

image. 


Algorithm 

GPU (RAM) 

Evaluation Time 

HYBRID 

DPM 

NVIDIA Quadro 2000 (1GB) 

0.5 seconds 

2 seconds 



Figure 2.3. Camp’s [16] best ROC curves at varying scales for performance 
comparison. We include both ROC curves as the HYBRID algorithm per¬ 
forms best across all scales, but the DPM algorithm achieves the best perfor¬ 
mance at larger scales. While Camp labeled the x-axis with 1 - Precision, 
from his paper he describes this value as equivalent to the FPR. Both 
algorithms show a significant decrease in performance on smaller scales. 
Source: [16, Figure 55, Figure 18]. 
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2.2.3 Detection of Screenshots by Lauren Sharpe 

The final experiment for comparison autonomously detects whether or not an image is a 
screenshot. This experiment simulates the use case for a national level agency that must sort 
through volumes of images to identify those that meet a specific threat profile or criteria. In 
this case, the images sought may be from a terrorist group hiding instructions for its members 
in posts of a screenshot. A group might do this versus posting via traditional methods if 
they believe the U.S. Government is searching text posts autonomously. Separately, if this 
agency was scanning collected hard drives, screenshots could require human inspection 
to determine if a drive’s owner was producing manuals or instructions sets for nefarious 
purposes, like bomb making. In contrast to marking a screenshot for review, this tool 
could also serve to filter out screenshots from further analysis by humans if an agency 
considered them noise. Lauren Sharpe [17] first published the results in this section in her 
NFS thesis in June of 2013. For her work, Sharpe [17] collected screenshot images from 
the Wikimedia Commons’ user-tagged Screenshots category [25]. This source produced 
a positive image set of 2,423 images. For negative examples, Sharpe [17] utilized the 13 
Natural Scene Categories dataset [26], containing scenes of furniture in homes, highways, 
buildings, coastlines and more. The total non-screenshot examples, or other category, 
amounted to 3,694 images. Her negative example set only contains grayscale images while 
her screenshot examples contain color images. Sharpe [17] deployed OpenCV, the Python 
Image Library (PIL) [27], and Orange [28] software libraries to implement her algorithms. 
She extracted multiple feature sets with OpenCV and PIL and fed them into an a Naive 
Bayes Classifier from the Orange library. She implemented ten-fold cross-validation across 
2,400 positive and 2,400 negative example images to evaluate her algorithm. Table 2.3 
depicts both the training and evaluation runtime statistics of Sharpe’s [17] best performing 
algorithm. She combines these times due to her cross-validation method. The four feature 
sets she manually selected and tested are: Line Segment Percentages (1), Line Segments 
Binned by Length (2), Intensity Entropy (3), and Intensity-Based Histograms (4). Table 2.4 
depicts the performance of Sharpe’s [17] algorithm testing 11 combinations of these feature 
sets. Also, Figure 2.4 depicts a ROC curve to display performance of the first feature set 
only. These are the metrics that will serve as a baseline for comparing performance to our 
algorithm implemented in TensorFlow. 
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Table 2.3. Training and evaluation runtime statistics from Sharpe’s [17] 
thesis. 


Algorithm 

Images 

VM Resources 

Train & Test Time 

All Features 

4800 

32-bit VM with 1GB RAM 

14 minutes 

All Features 

1 

32-bit VM with 1GB RAM 

0.18 seconds 


Sharpe [17] ran her algorithms in a virtual machine running on a Windows 7 system 
with 8GB RAM and an Intel Core 17-2600 processor. She did not provide further 
details than those listed in this table as to the total resources this virtual machine 
had access to from its hosted system, (i.e., number of processor cores) 


Table 2.4. Sharpe’s [17] screenshot detector results for combinations of four 
feature sets. Source: [17, Table 4.9]. 


Set 1 

Set 2 

Set 3 

Set 4 

Line Bins 

Color Bins 

Accuracy 

Precision 

Recall 

F-score 

- 

- 

Yes 

Yes 

- 

50 

0.942 

0.950 

0.933 

0.941 

- 

Yes 

- 

Yes 

10 

500 

0.968 

0.993 

0.942 

0.967 

- 

Yes 

Yes 

- 

10 

- 

0.937 

0.994 

0.880 

0.933 

- 

Yes 

Yes 

Yes 

10 

500 

0.966 

0.991 

0.941 

0.965 

Yes 

- 

- 

Yes 

- 

500 

0.963 

0.989 

0.937 

0.962 

Yes 

- 

Yes 

_ 

_ 

- 

0.954 

0.963 

0.945 

0.954 

Yes 

- 

Yes 

Yes 

- 

10 

0.967 

0.981 

0.953 

0.967 

Yes 

Yes 

- 

- 

10 

- 

0.956 

0.973 

0.937 

0.955 

Yes 

Yes 

- 

Yes 

50 

5(K) 

0.975 

0.997 

0.952 

0.974 

Yes 

Yes 

Yes 

- 

10 

- 

0.973 

0.987 

0.957 

0.972 

Yes 

Yes 

Yes 

Yes 

10 

10 

0.980 

0.997 

0.963 

0.980 


Sharpe [17] conducted experiments for the 11 combinations of her four feature 
sets in this table. For each combination containing the feature set Line Segments 
Binned by Length (2) or Intensity-Based Histograms (4) she varied the number of 
Line Bins and Color Bins, respectively. She tested five bin sizes (10, 50, 100, 500, 
and 1000) for each. The line bins grouped line segments by their length and the 
color bins grouped pixels in bins by pixel-value intensity. Sharpe [17] depicts only 
the best performing number of bins for each feature set combination in the table. 
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Figure 2.4. ROC curve for Sharpe’s [17] first feature set. This is the only 
ROC curve that Sharpe [17] provided in her report. Even though it is not 
for the best performing algorithm, which combined all four feature sets, we 
include it for comparison to our ROC curves. Each line in the plot represents 
results for one of ten iterations of the ten-fold cross-validation Sharpe [17] 
conducted for each algorithm. Source: [17, Figure 4.16]. 


2.3 Overview of CNNs and Modern Models 

This section describes the key ideas that make our experiments possible and the underlying 
technology behind the models we apply with TensorFlow. First, we present a swift intro¬ 
duction to the convolution operation, the convolutional layer, and the pooling layer as the 
building blocks for the CNNs from our experiments. The next section explores the concept 
of deep learning and its benefits when combined with convolutions in deep convolutional 
neural networks. A third section covers the importance of quality training data and some 
of the large modern datasets available. Fourth, Section 2.3.4 introduces and describes the 
specific models, already configured and available in TensorFlow, that we evaluate in the 
experiments detailed in Chapter 4. Finally, we discuss transfer learning and some of the 
key techniques available to improve performance of these models during training. A note 
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to the reader: we will interchange the common terms network, model, and architecture to 
describe the algorithms. These sections serve as the academic foundation for our research, 
which for the most part TensorFlow’s application program interface (API) abstracts away, 
but these concepts remain essential to understanding what the software does. 

2.3.1 Convolutional Neural Networks 

The academic community has thoroughly documented artificial neural networks (ANNs) 
and their usefulness across a spectrum of data science problems and we will forgo an 
introduction in this paper. Instead, we focus on a specific type of ANN, known as a deep 
convolutional neural network, which sit at the core of most of the recent breakthroughs 
in computer vision related tasks [12]. The academic competition inspired by challenges 
like ImageNet, introduced in Chapter 1 and discussed later, fueled improved versions of 
CNNs [13], [29]-[31] with near human-level performance on complex image classification 
tasks [1]. Before we dive into details of these algorithms, we discuss the fundamental 
building block that makes a neural network convolutional; the convolution itself. We then 
give, a brief summary of other CNN components before moving on to a discussion of more 
complex modern CNNs. 

The simple example in Figure 2.5 provides an intuition for the inner workings of a convo¬ 
lutional layer. Consider a vision algorithm designed to provide an autonomous system the 
ability to “see” a printed tic-tac-toe board. This ability will serve as a first step in building 
a complete system to compete with a human being. In our example image from the figure, 
each block in the image represents a pixel with a value of zero for white and one for black. 
Our algorithm will engage two convolutional kernels, or filters, that contain the patterns 
to match in portions of our image. By matching, we hope to identify key locations in the 
tic-tac-toe board. Through the convolution operation, which mathematically executes a 
dot product, a single output value gets produced for each window location in the image. 
Logically, we determine the spacing of the evaluation locations by starting in the upper left 
hand corner and repeatedly sliding the center of the window right by two pixels, a stride of 
two. We move the corner-detection filter (highlighted in blue) across the top three rows until 
reaching the right edge of the image and then start over again on the left side but two rows 
lower. This process continues until reaching the bottom right edge of the image. The space- 
detection filter at the bottom of the image (highlighted in red) mirrors the corner-detection 
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filter for this illustration. Typically, convolutional layers calculate the outputs for all filters 
across an entire image simultaneously. They do so through efficient matrix multiplication 
and addition. The gold boxes in the seeond row of the figure show the calculations that 
take plaee at each location to determine the output. We omit multiplieations resulting in 
zero because of an image or filter value of zero and thus only the pixels in the yellow boxes 
affeet the output of the eonvolution at this window position. The final row of the figure 
depiets the loeations where our two filters produee their maximal output. These loeations 
correspond to the corners defining the board for the corner-detection filter and the possible 
play locations for the space-detection filter. Figure 2.5 presents only the basie principles of 
the convolution for a two dimension, one ehannel example. An inquisitive reader ean find 
an explanation of concepts like padding, convolution of multiple ehannel filters aeross the 
red, green, and blue ehannels of an image, and weight initialization in resourees sueh as 
Stanford’s online course on CNNs [32]. 
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Figure 2.5. Simple example to introduce the convolution. 
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A few key eoncepts remain to eomplete an overview of CNNs (see Nielsen’s online textbook 
for a more thorough discussion [33]). First, the filter values, also known as weights, are one 
component a CNN updates during training. These networks essentially learn, by labeled 
example images, to recognize patterns from raw pixel values that indicate a certain type 
of object exists in the image. The initial convolutional layers possess multiple filters with 
learned weights that draw out meaningful features from groups of pixels across an image; 
edges, color patterns, corners and more. Modern CNNs learn hundreds of different filters 
across their many layers, the first filters applied to the raw pixel input and later filters 
applied to the output values of previous convolutional layers. Each convolutional-layer 
output summarizes how strongly a learned pattern exists in a group of pixels or previous 
layer outputs. In our example from Figure 2.5, a single output value for each filter represents 
nine pixels at each evaluation location. The outputs of the convolution are run through an 
activation function which makes a determination if a pattern is prevalent enough to pass to 
the next layers. Second, the pooling layer is another key layer in CNNs [32]. These layers 
follow convolutional layers and reduce the output values passed to later layers. Max-pooling 
is a common pooling layer function and it simply takes a max of adjacent outputs from the 
convolutional layer, perhaps reducing four adjacent outputs to the largest output of the four. 
Convolutional and pooling layers then get repeated with several other innovative techniques, 
such as dropout [34], employed at certain layers. In this manner a CNN pushes only the 
most strongly detected feature information to later layers [2] to drive the algorithm towards 
a set of output neurons, each corresponding to a specific classification. Now that we have 
established a description of the convolution and the general construction of a CNN, we 
will proceed to an examination of what makes such networks “deep” and how this depth 
provides additional expressive capacity. 

2.3.2 Deep Learning 

We start with a definition from an expert in the field of deep learning, Yoshua Bengio: 
“A deep learning algorithm is a particular kind of representation learning procedure that 
discovers multiple levels of representation, with higher-level features representing more 
abstract aspects of the data” [2]. By simply repeating combinations of the layers discussed 
in the previous section, a CNN becomes deep. Compared to previous machine learning 
algorithms, deep learning allows an automated discovery of higher-level features and ab¬ 
stractions from raw input data. Deep CNNs do not require manual extraction of features 


24 



prior to feeding inputs into a neural network, as seen in the previous methods discussed 
in Section 2.2. These algorithms first learn low-level features, which typically look l ik e 
simple edge and pattern detectors tailored to the training data [35]. These first features, 
closer to individual pixels than humans can inspect or identify efficiently, allow a neural 
network to then learn larger abstractions through combinations of the features. Referring 
back to Figure 2.5 and our tic-tac-toe example, a next convolutional layer may combine the 
abstractions of a playable space identified by the space-detection filter, surrounded at all four 
corners by intersecting lines identified from the corner-detection filter, as a new abstrac¬ 
tion of the center playable space of the tic-tac-toe board. Expanding this simple concept 
of combination across many more layers, certain types of deep learning neural networks 
become surprisingly efficient at classification requiring a high level of abstraction [2]. For 
this reason, deep learning algorithms serve as the base for the significant improvements of 
artificial intelligence applications in the field of computer vision occurring rapidly in only 
the past few years [2]. 

With a brief introduction behind us, we will explore the unique benefits and challenges 
of employing deep architectures. The benefit, as cited in Chapter 1, is the reduction of 
state-of-the-art error rates by 30% to 50% on complex benchmark competitions [2], like 
ImageNet. Through this competition, the importance of network depth emerges as all recent 
top performing models possessed depths between sixteen to thirty layers [31]. The challenge 
lies in the fact that as networks get deeper they are more difficult to train [31]. More layers 
mandate more trainable parameters. As a result, deep CNNs depend upon thousands of 
labeled examples to train to convergence without overfitting [29]. Along with a need for more 
data, deep networks take longer to train and require specialized hardware to do so efficiently. 
Specifically, a programmer can now train deep learning algorithms across multiple GPUs, 
instead of a single GPU or many traditional CPUs, to speed up training by several orders of 
magnitude [2]. These hardware advances enable researchers to train hundreds of iterations 
of their algorithms, adjusting key hyperparameters like learning rate, to achieve optimal 
performance and reap the benefits of higher levels of abstraction [29]. Still, the complexity 
of modern architectures can lead to training times measured in days to weeks [30] to obtain 
the performance of the algorithms winning classification competitions. In summary, while 
the performance benefits impress, training deep algorithms proves difficult and limitations 
exist to simply continuing to increase the depth to further improve performance [31]. 
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2.3.3 The Importance of Data and Publicly Available Image Sets 

With increases in the complexity of deep CNN architectures, the importance of quantity 
and quality of training examples for learning also increases. Some experts even go as far 
as making the argument: “As a rule of thumb, a dumb algorithm with lots and lots of data 
beats a clever one with modest amounts of it” [36]. This section introduces techniques 
for augmenting smaller datasets and some publicly available image sets to serve as a 
starting point for training and testing algorithms. Several methods exist that reduce the 
number of ground truth examples necessary for convergence: smarter weight initialization; 
training first with artificial data [37]; and transfer learning [38], a method for repurposing 
already trained networks. Of these methods we specifically benefit from transfer learning, 
which we will examine in Section 2.3.4. Still, there remains a need for a large number 
of quality training examples when implementing deep CNNs. There must be enough 
labeled data to first train reliable low-level feature extractors and also build the mid-to-high 
levels of abstractions necessary to benefit from deep architectures. Furthermore, it can be 
difficult [39], but not impossible [35], to know what features a trained neural network is 
learning in order to classify an image. With too few examples, all training images could 
possess a similar trait (like the same background), unrelated to the object for detection, 
that an algorithm learns as an indicator of the object. Thus, if we want to be sure that the 
network learns the object we intend, our datasets must possess sufficient quality examples: 
that is, enough instances of an object in a variety of realistic settings. We must show an 
algorithm examples of that object in different lighting conditions, from different angles, and 
in the type of scenes in which we want to be sure our algorithm works. Fortunately, there 
are large and diverse datasets available to help overcome this challenge. Table 2.5 presents 
some of the largest available to aid the DOD in machine learning and image classification 
related research; our research specifically benefited from ImageNet. 

A final dataset-related concept is the trend of complex networks to perform image prepro¬ 
cessing during training to artificially increase the size of the datasets in hopes of reducing 
overfitting [29]. Since the networks usually require a constant input size, some combination 
of scaling and cropping is necessary to make images fit in each network due to the variety 
of sizes and resolutions present throughout datasets [29]. Further, to ensure brightness 
invariance, networks may normalize pixel brightness across a batch of images or randomly 
adjust it to ensure that illumination of an object in an image does not affect its classification. 
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Table 2.5. A sample of online image sets for training deep networks. 


dataset 

# Categories 

Images 

DOD Related Labels 

ImageNet [1] 

1,000 

1,331,167 

rifle, assault rifle, tank, bulletproof 
vest, and more 

COCO [40] 

80 

300,000 

airplane, car, truck, person 

Open Images [41] 

6,000 

9 Million 

warship, tarmac, rifle, and more 


This table lists some of the labels from each set that apply to the DOD. ImageNet 
is the most recognized and organized of the datasets, however it does not provide 
licenses for republishing the images. The Microsoft sponsored COCO is the most 
unique. It offers five captions for each of its images that attempt to explain how 
objects in the image are interacting. Open Images, created by Google, is a library 
of URLs to labeled image data that strives to contain only creative commons 
license images, lifting restriction on republishing and reusing the images. 


Other techniques involving random distortions of the training images, like cropping, flip¬ 
ping from left to right, or scaling, can artificially increase the number of training examples 
by creating multiple unique observations from a single image [42]. Preprocessing methods 
vary by model and TensorFlow implements those prescribed by the model’s designers while 
allowing a user to adjust the preprocessing steps as desired. 


2.3.4 Modern Models and Performance Improvements 

The trend of researchers to open source their image classification algorithms [13], [29]- 
[31] makes them available for others to apply to new problems with little effort. Even 
more convenient, a number of software libraries quickly implement the newest algorithms. 
This section introduces three successful deep CNNs, provides a brief summary of their key 
contributions, and performs a high level comparison of the algorithms. Experts apply many 
innovative techniques across the field of deep CNNs that the creators of these algorithms 
cover in cited papers but we omit from our discussion for the sake of brevity. 

The first algorithm, trained by Oxford’s Visual Geometry Group (VGG) and thus known 
as VGG Network (VGGNet) [30], secured a first and second place position in the lESVRC 
2014. There are three different versions of this network available for testing in many open- 
source libraries, varying the number of convolutional layers to provide network depths of 
11, 16, and 19 while keeping other parameters the same. VGGNet’s contribution came by 
showing that a network with very small convolution filters, three by three pixels as compared 
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to other networks with larger five by five or seven by seven filters [42], eould aehieve state of 
the art performanee by simply inereasing network depth. This depth inerease was possible 
in the network due to the deereased number of trainable parameters required by the smaller 
eonvolutions [30]. Figure 2.6 depiets the largest VGGNet arehiteeture in a side-by-side 
eomparison with Mierosoft’s ResNet model and Table 2.6 presents a high level eomparison 
of the network to another reeent model. 
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Figure 2.6. VGGNet arud ResNet architectures. We recreate this figure from 
He et al. [31, Figure 3]. 
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Microsoft’s ResNet [31], which took first place in the 2015 competitions for both ILSVRC 
and COCO, is a second algorithm of interest. ResNet introduced the latest significant 
improvement for state-of-the-art CNNs [32] through its employment of residual layers. In 
seeking to develop deeper networks, and overcome some of the challenges mentioned in 
Section 2.3.2, Microsoft’s research team sought to simplify the training process. The team 
created residual layers which they discuss in detail in their paper [31]. Here, we will only 
attempt to establish an intuition of the benefit from the residual layers. These layers simply 
forward the inputs of previous layers ahead two layers, as depicted by the arrows on the right 
of Figure 2.6, along with new information pulled out from the previous convolutional layers. 
In this manner a convolutional layer has all of the information provided to previous layers 
available to it and only has to extract additional information, or a residual mapping, if it 
improves classification performance. This affords a network architect the ability to increase 
the number of layers and know that a network will converge on the optimal number. Residual 
layers ensure this by allowing a layer to forgo extracting additional information and to simply 
pass the output of previous layers forward [31]. This method introduces efficiencies that 
lead to significantly shorter training times [13], [31]. Table 2.6 includes details of two 
ResNet models for comparison. 

Google’s Inception networks are the final model we introduce. The first version of the 
network, also known as GoogLeNet, won one of the ILSVRC 2014 challenges [42]. Since 
this competition, Google released several iterations of the network; we will focus on two. 
First, Inception-v3 [12], for its computational efficiency, making it suitable for use in 
resource-constrained environments, coupled with its state-of-the-art performance. Next, 
Google’s latest iteration of the network, Inception-ResNet-v2 [13], which incorporates 
Microsoft’s residual layers to further reduce classification error rates. Figs. 2.7 and 2.8 
depict the architectures for these networks and Table 2.6 provides their key performance 
specifications. 
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Figure 2.7. Illustration of the lnception-v3 architecture. From the left side 
of the figure, the lnception-v3 combines convolutions and max-pooling lay¬ 
ers to quickly pull out relevant patterns from the pixel values to reduce the 
299x299x3 input images to a 35x35x192 output after the second max-pooling 
layer. This smaller output represents the low-level features pulled from the 
image. The lnception-v3 model then feeds the smaller output through mul¬ 
tiple paths of what Szegedy et al. [12] call Inception modules, to detect 
higher level abstractions from the features. The salmon-colored concatena¬ 
tion nodes in the diagram represent a stacking of the outputs from all of 
the Inception module paths to provide them to the next Inception module. 
Repeating several variants of these modules, finally the lnception-v3 model 
feeds the output to a fully-connected layer (second node from the right) 
with 1,001 neurons representing the 1,000 ImageNet categories along with 
a background category. Source: [43]. 
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Figure 2.8. Illustration of the lnception-ResNet-v2 architecture. The 
lnception-ResNet-v2 model contains a similar architecture to the Inception- 
v3 model discussed in Figure 2.7. Two key differences are the fewer con¬ 
volutions in each Inception module (defined in Figure 2.7) but significantly 
more repetitions of these modules to allow a greater network depth. This 
figure depicts the residual layer connections (introduced by He et al. [31]) 
as purple circles after three variants of Inception modules. These residual 
connections facilitate the blocks of 10 or 20 repetitions (depicted in the 
color boxes) for the three modules. The first six and last five nodes of the 
lnception-ResNet-v2 architecture match lnception-v3. Source: [43]. 


The models presented in this seetion are eomplex, but their sueeess at solving ehallenging 
image elassifieation tasks makes them worth studying. Teams of experts in the field of 
maehine learning developed eaeh of these algorithms. Few individuals, or even teams, 
eould independently ereate and employ the arehiteetures and methods neeessary to aehieve 
similar eutting-edge performanee. Fortunately, sinee these arehiteetures are made available 
publiely by their ereators, and then implemented in software through open-souree libraries, 
groups that eould not produee them independently have the ability to employ them. 

2.3.5 Training Methods 

With a diseussion of state-of-the-art CNNs’s performanee on the ImageNet dataset eom- 
pleted, in this seetion we will look to applying them to organization-speeifie image elassifi¬ 
eation tasks. As diseussed earlier, these networks require large amounts of quality training 
data to be sueeessful. If an organization has enough labeled training data available, it 
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Table 

2.6. A top-level 

overview of three 

deep CNNs. 


Algorithm 

Top-1 Err. (%) 

Top-5 Err. (%) 

Parameters 

FLOPS 

VGGNet 19 

24.7 

7.1 

144M 

19.6G [31] 

ResNet 34 

21.53 

5.6 

~50M 

3.6G 

ResNet 152 

19.38 

4.49 

- 

11.3G 

Inception-v3 

19.47 

4.48 

25M 

5.0G 

Inception-ResNet-v2 

18.7 

4.1 

- 

- 


All comparisons are for a single model for top-1 and top-5 classification error 
rates from the ImageNet 2012 validation dataset. FLOPs refer to the number 
of multiply-add operations for evaluating an image in the network. All numbers 
are from the papers [12], [13], [30], [31] released in conjunction with the ILSVRC 
challenge, or later upon updates of the architectures, unless otherwise cited. Mi¬ 
crosoft did not provide the parameter counts for its ResNet 34 and ResNet 152 
networks. The ResNet 34 parameter count is an estimate we calculate from Fig¬ 
ure 2.6. VGGNet's three fully-connected layers at the end of its network account 
for about lOOM of its parameters. The other networks only apply a single fully- 
connected layer; thus, they contain significantly fewer parameters. VGGNet and 
ResNet implemented an average of ten different crops across an image during 
evaluation whereas the Inception networks average twelve. Empty locations in the 
table are for figures not provided by model creators and not easily calculated due 
to model complexity. 


is possible to train models from scratch on new classification categories. Realistically, 
acquiring and labeling a sufficient volume of training data for each new problem proves 
inefficient and severely limits the tasks for which users can apply these powerful models. 
Thankfully, another training method called transfer learning [38] exists. This method cap¬ 
italizes on the models’ several levels of abstraction, as discussed in Section 2.3.2. Since 
the lowest-level-convolutional layers learn to extract features like edges and color patterns 
that apply to any category of classification, users do not need to retrain these layers for each 
new task. Similarly, as noted in [38], the middle and higher levels of abstractions also prove 
useful in most tasks for networks trained on diverse classification problems like ILSVRC. 
Thus, it is possible to simply replace the last level of abstraction, the categories themselves. 
Starting with a network trained on a large image set like ImageNet, one can simply retrain 
the last fully-connected layer of network. See the darker purple oval second from the right 
of Figure 2.8 to gain an appreciation for all of the layers to the left of this purple oval that 
we reuse with transfer learning. With this method, it is possible to reduce training time 
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from days to minutes, and the number of images required from tens of thousands to only 
hundreds, while still achieving state-of-the-art performance. 

With an understanding of supportive policy for acquiring TensorFlow in the DOD, previous 
NFS experiments and image sets that serve as potential application areas, and state-of-the- 
art models and training methods available through the software library, it is now possible 
to develop experiments to provide empirical evidence of TensorFlow’s suitability for DOD 
specific applications. 
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CHAPTER 3: 
Methodology 


This chapter presents the logic behind our experimental design as we pursue answers to our 
research questions. Through our experiments we explore Google’s TensorFlow software 
framework and evaluate its usefulness across the DOD. To accomplish this, we compare 
the performance of algorithms implemented with the software to three previously employed 
machine learning methods and experiments conducted at NFS. Here, we first provide 
the scope of our research since we do not test every potential application of TensorFlow. 
Then we discuss the reasoning behind our design of three experiments within our scope. 
Overall, this chapter justifies our experimental design and our selection of applications of 
the software we test. 


3.1 Scope 

TensorFlow is a significant software library with more functions and applications than we 
are able to test. Our evaluation of TensorFlow focuses on object detection as a starting 
point to make recommendations for future DOD research towards the implementation of 
open-source software for other machine learning tasks. To further scope the problem, this 
study will not: 

• Conduct an analysis of the other available open-source frameworks 

• Implement new algorithms for image classification 

• Complete testing of these algorithms in operational environments 

• Test all TensorFlow use cases (i.e., speech and text recognition and prediction) 

The purpose remains to provide a logical and empirically based recommendation for the 
DOD to pursue open-source deep learning software solutions in the future. In setting out 
to achieve this end and answer our research questions, three lines of effort for applying 
TensorFlow arise within our scope: employing TensorFlow-based algorithms without mod¬ 
ification when an existing category applies to a current object; repurposing such algorithms 
for DOD-specific use cases through transfer learning when 50 or more training images 
exist for an object; and training proven algorithms from scratch to build new object de- 
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lectors when many thousand training examples are available. We select these applications 
of TensorFlow to provide a sufficient survey of the software’s utility with respect to DOD 
requirements. Described in detail in upcoming sections, we find reasons for applying one of 
these lines of efforts to each image set from the three NFS theses introduced in Chapter 2. 

3.2 Employing TensorFlow-Based Algorithms without 
Modification 

The trained Inception models available through TensorFlow already afford cutting-edge 
classification performance on the 1,000 categories from ILSVRC 2012. We sought to apply 
these algorithms without modification to a DOD media analysis problem. Jones’ [15] AK- 
47 image set met the necessary criteria as two categories from the ILSVRC 2012 training 
set appeared useful for detecting these weapons in images: the rifle and the assault rifle, 
assault gun (AR) categories. The presence of these relevant categories allows us to simply 
run Jones’ [15] test set images through the algorithms to establish their ability to identify 
weapons, like AK-47s, in intelligence-related images. Since TensorFlow provides access 
to multiple Inception models, each one reportedly improving performance on the ILSVRC 
2012 validation set [12], [13], we choose to repeat the evaluation for each version. This 
repetition captures a snapshot of Google’s improvement of the algorithm as an AK-47 
detector over the five iterations. After seeing high performance on Jones’ [15] test set, 
we then decided to conduct tests of the best performing model on a collection of more 
difficult AK-47 images and negative examples to further stretch the algorithm. Finally, 
one of our research questions asks how deep CNNs implemented in TensorFlow perform 
in resource-constrained environments. In pursuit of this answer we chose to evaluate the 
five Inception models with an ideal allocation of compute resources and also with limited 
resources comparable to those available in military-deployed environments. The overall 
purpose of this experiment is to prove the power of open-source algorithms available through 
TensorFlow without requiring modification and to show the promise of future performance 
improvements based on past revisions. 
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3.3 Repurposing Available Algorithms with Transfer 
Learning 

For a second image set, the collection of ship images from Camp [16], we repurpose a 
single Inception model using transfer learning to build a binary classifier for ship detection. 
The major factor driving our decision to apply transfer learning to Camp’s [16] image set 
was the limited number of training examples in the dataset. Also, ship-related categories 
in the ILSVRC 2012 image set produce a confidence that the previous-learned features 
are relevant for distinguishing Camp’s [16] ship or no ship categories. We elect to retrain 
a single Inception model for the sake of simplicity. For this experiment, we select the 
Inception-v3 model due to its balance of performance and also computational efficiency 
(see Table 4.3). Since the related operational scenario for a ship detector in a surface-based 
UAS would likely mandate constrained compute resources, the selection of Inception-v3 
is logical. Further, Google provides a tutorial and script, cited later in Section 4.2.4, that 
simplifies the process of conducting the experiment. In completing this second experiment, 
we strive to prove the applicability of TensorFlow-based object detectors for problems with 
limited training examples and in resource-constrained systems. 

3.4 Training a Deep CNN from Scratch 

The third and final experiment trains three publicly released deep CNNs from scratch and 
tests their performance in identifying screenshots from Sharpe’s [17] image set. Since deep 
algorithms contain significantly more learnable parameters than other machine learning 
algorithms, we understood that our three previous image sets did not contain enough 
examples for both training and evaluation. Hence, Sharpe’s [17] image set is the focus of 
the third test of TensorFlow as her subject, screenshots, proves the simplest of the three 
to acquire. We had to assemble a new training set for this experiment. By doing so, we 
explore TensorFlow’s utility in applying a model tailored down to the lowest-level-feature 
extractors to meet defense-specific media analysis needs. The goal of our final experiment 
is to prove that when an organization in the DOD possesses a substantial labeled image set, 
they can train a deep CNN from scratch with open-source software and algorithms. 

This concludes the logic behind our experimental design. Chapter 4 focuses on providing 
specific details and processes we execute to perform these three experiments. 
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CHAPTER 4: 
Experiments 


This chapter presents the details and processes for employing deep CNNs in TensorFlow 
on the three previous image sets introduced in Chapter 2. For each experiment, we discuss 
the image sets and steps for training and evaluating the models along with the performance 
measures applied to produce our results for Chapter 5. Throughout this chapter, it is 
worth remembering the goal of our experiments: to provide an empirical assessment 
of TensorFlow’s applicability in the DOD through completing experiments that simulate 
potential use cases. 

4.1 AK-47 Detection 

This section describes our experiment applying the five Inception models available in 
TensorFlow to Jones’ [15] and other AK-47 test sets. We first describe details for all of 
the image sets tested. Then, we cover the training of the Inception models conducted 
by Google researchers and describe our process for evaluating the models. Finally, we 
introduce our performance metrics for all of our experiments and compare them to specific 
metrics developed by Jones [15]. 

4.1.1 AK-47 Related Image Sets 

As a starting point, we introduce the images employed for training and evaluating the 
Inception models (see Table 4.1). Google trains these algorithms to classify images into 
1,000 categories from the ILSVRC 2012 training set. In order to determine if an AK- 
47 exists in an image we reference the AR category and the rifle category. The table 
combines the remaining 998 other categories Google employed to train the Inception 
models. We evaluate the models’ performance on three sources of test images. The first 
is from Jones’ [15] experiment. The second, AK-47 Internet Images, is a collection of 
AK-47 images from the NFS Vision Lab database in dynamic settings, like urban combat 
environments. The other category from this set possess images with groups of people in 
congested environments: in living rooms, large groups posing for photographs, and crowds. 
In the third test set Rifle-Like Images, we gathered 100 images from the internet with people 
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carrying objects like paintball guns, lawn and power tools, and sports equipment. We ehose 
these images to ensure the algorithm did not show bias towards depietions of people holding 
everyday objeets. We inelude the last two image sets for testing in an attempt to stretch the 
algorithm further than the more generic video stills provided by Jones [15]. 


Table 4.1. Training and test sets for the AK-47 experiment. 


Training Sets 

Category 

Number of Images 


assault rifle, assault gun (AR) 

1,172 

ImageNet [1] 

rifle 

1,475 


other eategories 

1,278,520 

Test Sets 

Jones [15] 

AK-47 

687 

other 

7,045 

AK-47 Internet Images 

AK-47 

other 

107 

143 

Rifle-Like Images 

other 

100 


4.1.2 Inception Training 

For this experiment we rely on training conducted and doeumented by Google researchers 
for the following Ineeption models: v 1 [42]; v2 and v3 [ 12]; and v4 and ResNet v2 [ 13]. We 
eonduct no further training or manipulation of these models for reasons stated previously. 
The trained models, implemented in the TensorFlow-Slim [44] library, are available for 
download [45]. 

4.1.3 Inception Models Evaluation 

After downloading the pretrained models, we modify seripts provided from TensorFlow 
tutorials [46] to eonduet our experiment. The script originally took a single image as 
input and produced the top-five classifieation seores. Our modifications allowed us to run 
all test images from a file direetory through the models one image at a time and save all 
1,000 elassifieation seores. We ran the experiments with the Python 2.7.12 interpreter and 
TensorFlow version 0.11.0rc2. We repeat the following process for each of the five Ineeption 
versions. The software first reads an image from a file and deeodes it into a bytecode format 
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that the Inception models require for input. It also conducts a bilinear rescaling and then 
a central crop on the image to reduce it to the required 299 by 299 pixel input size. After 
preprocessing, we feed the image through the model while keeping scores from the three 
test sets separate. We then save the model’s output per image: a score for each of the 1,000 
ILSVRC 2012 categories (plus a background category score that we ignore) with the sum 
of the scores equaling one. Think of these scores as a probability of each object existing in 
an image according to the model. 

We repeat our evaluations of the test images twice to provide time estimates for classifying 
images with ideal hardware and in a resource-constrained environment. Table 4.2 displays 
the hardware devices and specifications employed throughout all three of our experiments. 
Using the TITAN Black GPU, we provide times for processing the AK-47 test set as single 
images through the Inception models in Table 4.3. For the constrained tests, we engage 
a portion of the i7-5820K CPU’s resources through a virtualized Docker [47] container 
with only two cores of the processor and 4GB of RAM. The times represented in the table 
measure the entire process to accomplish the image classification for all of the three test 
sets: first the software restores the trained model version to memory; then, it runs a single 
image at a time through the model, including preprocessing; next, the script saves image 
scores in a comma-separated value file by directory; and finally it removes the model from 
memory. The table also presents the average time per image by dividing the total time by 
the number of images in all three test sets. While not a measure of the time it takes to feed a 
single image through these models, our measurement provides an estimate for how long it 
will take for the algorithm to classify a group of images and present them for input to other 
software modules or store them in a database. There are possible time savings, which we 
will present in Section 4.3, if users store the images in a format that allows the model to 
process them in groups called batches, typically ranging from sizes of 30 to 100 images. 

4.1.4 Performance Metrics 

Having gathered the raw output scores by category for each of the models, we compare them 
to the true labels of the test images to grade the classification performance of the Inception 
versions. First, we separate the AR category score and the rifle category score from the raw 
output scores. Then, we consider individually the AR score and the rifle score. To generate 
a third score for evaluation, we add the previous two category scores together. Finally, we 
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Table 4.2. Hardware employed for experiments. 


GPUs 

Clock Speed 

Cores 

Memory(GB) 

Memory Interface 

TITAN Black 

980 MHz 

2,880 

6.144 

GDDR5 

1080 Ti 

1.582 GHz 

3,584 

11 

GDDR5X 

CPUs 

i7-5820K 

3.30 GHz 

12 

32 

DDR4-2400 

15-3317U 

1.7 GHz 

2 

4 

DDR3 


Both GPUs are from the NVIDIA GeForce GTX series. The TITAN Black resides 
in a compute node on the NPS Hamming Supercomputer. The 1080 Ti resides 
in a personal desktop computer. Both CPUs are from the Intel Core series. The 
i7-5820K is in the same desktop computer as the 1080 Ti GPU. The i5-3317U 
exists in a laptop computer and provides for an evaluation of TensorFlow in a 
resource-constrained environment. 


Table 4.3. Inception models’ image classification times. 


Inception Version 

Total Time 

Time per Image 

Images / sec 

GPU 

vl 

141.7 

0.0175 

57.0 

v2 

174.2 

0.0216 

46.4 

v3 

276.6 

0.0342 

29.2 

v4 

485.4 

0.0601 

16.6 

ResNet v2 

565.7 

0.0700 

14.3 

CPU 

vl 

968.1 

0.120 

8.3 

v2 

1128.8 

0.140 

7.2 

v3 

2979.9 

0.369 

2.7 

v4 

5711.0 

0.707 

1.4 

ResNet v2 

6071.9 

0.751 

1.3 


All times are in seconds. We ran the image evaluation script on both the TITAN 
Black GPU and a portion of the i7-5820K CPU's resources through a virtualized 
Docker [47] container with only two cores of the processor and 4GB of RAM. See 
Table 4.2 for hardware details. We ran a total of 8,082 images through each model 
to generate the time values displayed in this table. 


compare these three scores to a sliding threshold of 1,000 equally spaced values between 
zero and one. At each, a score above the threshold value indicates a positive classification 
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by the model. We then count the true positives, false positives, and true negatives for our 
labeled test images and use these counts to develop our metrics for the 1,000 thresholds. 

From these counts, we calculate the recall, FPR, and F-score as defined in Equations 4.1, 
4.2, and 4.4. We derive precision (defined in Equation 4.3) as a necessary intermediate step 
to provide the E-score. As shown in Equation 4.4, the F-score metric is the harmonic mean 
of both precision and recall which we provide to identify an optimal threshold for classifier 
performance. We set yS equal to one to attribute equal importance to both precision and 
recall. Einally, we define accuracy here for completeness but only calculate it to compare 
model performance during training for the final two experiments. Plotting the recall versus 
EPR values at all of the 1,000 thresholds for the three different scores produces our ROC 
curves presented in Chapter 5. 

We employ the following abbreviations in the equations in this section: 

• TP = True positives 

• TN = True negatives 

• EP = Ealse positives 

• PE = Total number of positive examples 

• NE = Total number of negative examples 

Equations for recall, EPR, accuracy, and precision [48] and E-score [49]: 

TP 

recall = — (4.1) 


EPR = 


FP 


(4.2) 


TP 

precision = —- 

^ TP+ FP 


(4.3) 


E-score = 


(yS^ -I- 1) * precision * recall 
0^ * precision -l- recall 


(4.4) 
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TP+ TN 
accuracy = ——— 

PE +NE 


(4.5) 


4.1.5 AK-47 Experiment-Specific Performance Metrics 

We must note that our results do not directly compare to Jones’ [15] results. While we both 
present ROC curves as a performance measure, Jones’ [15] algorithm evaluates an image 
several times before providing a count of AK-47s in the image instead of an overall image 
classification. Thus, his definition of recall and FPR reproduced in equations 4.6 and 4.7 
vary from the standard definitions. Still, we present a general comparison of performance 
in Chapter 5. 

Equations from Jones’ [15] thesis: 


recall = 


NumberO/Weapons DetectedInSet 
T otalWeaponsInSet 


(4.6) 


FPR = 


NumberO fEalsePositives Ini mageSet 
T otalAreasChecked 


(4.7) 


4.2 Ship Detection 

The second experiment sets out to apply deep CNNs to a DOD image analysis task, ship 
detection, even though few training images are available. This section describes the image 
sets, image distortions, optimizations to reduce training time, and the training process for 
reapplying the pretrained Inception-v3 model from the previous experiment. Through these 
methods we replace the model’s 1,000 possible categories to produce only two classification 
scores; ship or no ship. The evaluation process and performance metrics remain the same 
as those presented previously. 

4.2.1 Ship Related Image Sets 

As described for the AK-47 experiment, we start with an Inception-v3 model pretrained 
on the ILSVRC 2012 training set. There are several relevant training categories from this 
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image set for ship classification. These categories suggest the feature extractors from the 
original Inception model, that will remain in our new ship or no ship classifier, will be 
successful; see Table 4.4 for a selection. Most of the images from Camp’s [16] thesis are 
close-ups of ships or contain water scenes with coastline, mountains, ice, or wildlife and the 
high performance of our first retrained model inspired us to acquire some more challenging 
examples for testing. Thus, we built the Ship Internet Images set by collecting images of 67 
ships on the horizon, with more unique shapes, in front of sunsets, in fog, and other more 
difficult settings. We also collected 86 negative images of close-ups of objects like trash, 
cargo containers, and debris floating in the water. Prioritizing a more rigorous test of the 
model, we divided this set into training and test examples by only transferring seven ship 
examples and 26 negative examples into the training set. After observing a high FPR on the 
Ship Internet Images set by an algorithm retrained with only Camp’s [16] training set, we 
sought to show that the Inception-v3 model was capable of classifying these more difficult 
images by developing a small training set relating to the Ship Internet Images negative 
examples. With the easy-to-find example images already in the test set, we built a training 
set without requiring all negative examples to be water scenes. Specifically, we located 
images of trash in landfills and other settings in hopes of teaching the algorithm that trash 
in the water was not a ship without requiring more examples of trash floating in the water. 
With this mindset, we found 226 additional negative training examples for the Ship Internet 
Images training set by including images of trash and shipping containers not in the water, 
as well as buoys, oil rigs, sunsets over water, and ocean wildlife. 

4.2.2 Image Distortions 

Using flags available in the retraining script, we applied several different distortions to 
the images during training. Table 4.5 presents the available distortions. We vary the 
combinations of these operations during the training phase to permit an evaluation of 
their affect on model performance. While these distortions can lead to increased classifier 
performance by artificially increasing the size of training set, engaging any one of them 
significantly slows the retraining process for reasons we will discuss in the next section. 
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Table 4.4. Training and test sets for the ship detection experiment. 


Training Sets 

Category 

Number of Images 


container ship,... 

1,613 


speedboat 

1,153 

ImageNet [1] 

pirate, pirate ship 

634 


seashore, coast,... 

2,382 


other categories 

1,275,385 

Camp [16] 

ship 
no ship 

no 

97 

Ship Internet Images 

ship 
no ship 

7 

226 

Test Sets 

Camp [16] 

ship 
no ship 

405(8x) 

99*(8x) 

Ship Internet Images 

ship 
no ship 

60 

60 


This table is not all-inclusive of useful ImageNet categories as we omitted some 
that contain smaller water craft. *Note: this number is one less than the reported 
100 images from Chapter 2. We found one repeated image in both the training 
and test set from Camp's thesis and thus removed it from the test set. Camp's test 
set had all images scaled at 5, 10, 15, 20, 25, 50, 75, and full scales to simulate 
classifier performance on ships further from a sensor or on the horizon. 


4.2.3 Bottleneck Calculations 

As mentioned previously, retraining only the last fully-connected layer of a deep CNN can 
significantly reduce the time required to apply an existing algorithm to a new task. The 
program provided for retraining the Inception-v3 model achieves a large portion of this 
efficiency by creating bottleneck files. These files are essentially the model’s output of the 
layer immediately before the last fully-connected layer; the only layer we update during 
training. Running each image through about 98% of the network only once, and storing the 
output as bottleneck files for later training steps, we avoid the additional operations required 
to feed an image through the entire network hundreds of times [50]. A downside of the 
operations discussed in Section 4.2.2 is that each distorted image is a new image that we 
must run through the entire network, eliminating this efficiency and significantly increasing 
training time. Table 4.6 illustrates this slow-down when retraining the Inception-v3 model 
under resource constraints, such as on a laptop in a deployed environment. Still, even with 
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Table 4.5. Distortions available in the retraining script from TensorFlow. 


Distortion Name 

Effect 

Random Crop 

Accepts a percentage {RCP). Randomly picks a center for the 
first crop in the image and maintains aspect ratio while keeping 
a portion of the image that has a width of the original image 
minus the RCP. RCP=10 would produce an image whose width 
was 90% of the original width. 

Random Scaling 

Accepts a percentage (RSP). First conducts a central crop to 
create an image whose scaled width (5w) is chosen by a random 
variable (r) in a range from original width minus RSP to the 
original width (w) as follows: sw = w - * r. This dis¬ 

tortion then employs bilinear interpolation to resize the cropped 
image by a random percentage between zero and the RSP. 

Random Brightness 

Accepts a percentage (RBP). Using a random uniform distri¬ 
bution, selects a number between 1+ to multiply with the 

original image pixel values. 

Random Flip 

Accepts a Boolean. If the Boolean equals true, randomly flips 
50% of images horizontally left to right during training. 


These distortions help to ensure the model is invariant to changes in the depiction 
of an object in an image. Rather than requiring unique examples of an object at 
different scales, horizontal orientations, brightness, and occlusions, these functions 
instead modify existing images. This effectively increases the size of the training 
set [50]. These distortions are specific to the lnception-v3 retraining script [51] 
available through TensorFlow. 


distortions, transfer learning provides effieiencies from not requiring gradients to pass all 
the way back through a network during backpropagation and weight updates. The training 
times presented show it is possible for DOD members to repurpose pretrained models to 
unit specific tasks in environments with limited resources. 


4.2.4 Retraining Inception-v3 via Transfer Learning 

This section describes the process for retraining the Inception-v3 model to classify images 
as ship or no ship. We employ the 1080 Ti GPU from Table 4.2 for all of the experiments 
in this section. We follow a simple training tutorial [50] and script [51] for guiding this 
process. We first retrain the model with only Camp’s training set, no distortions, and the 
following hyperparameters: 
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Table 4.6. Retraining times for lnception-v3 in a resource-constrained envi¬ 
ronment. 


Distortions 

Training Steps 

Training Time 

None, default 

4,000 

5m 

None, default 

8,000 

11m 

Random Scale (10%) 

4,000 

Id 22h 54m 

All(10%) and Flip(T) 

4,000 

Id 23h 17m 

All(10%) and Flip(T) 

8,000 

3d 22h 30m 


This table depicts slow down in network retraining introduced by distortions. We 
conducted these tests on a personal laptop running 64-bit Linux Ubuntu 16.04, 
with the i5 CPU from Table 4.2. We ran TensorFlow version 0.8.0 in a virtualized 
Docker container having access to all system resources to produce these num¬ 
bers, following the steps described in [50]. This method has a one-time cost of 
approximately 25 minutes to compile the initial model to maximize training and 
evaluation speed on our specific CPU. We complete all other iterations with the 
same model with default settings and change the distortion flags from Table 4.5 
to the values in parenthesis. The training set included about 2,000 images divided 
into two categories and the batch size per step was 100 images. 


• batch size =100 

• training steps = 1,000 

• learning rate = 0.01 

• validation percentage = 10% 


We leave the batch size, the number of images fed through the model per training step before 
updating the model’s learned weights, and learning rate as the default recommended values 
provided with the script. We limit the training steps to 1,000 due a larger batch size and 
smaller number of training examples. In this case the model trains on each image 535 times, 
or completes 535 epochs of training. The validation percentage flag provides how many 
images to set aside from the training set to evaluate performance of the model throughout 
training. The script selects the validation images based off of a hash of their filenames; 
they remain consistent across runs. Table 4.7 provides the training and validation image 
counts for our runs. The size of our validation set is a source for concern; however, when 
working with a limited training set it remains necessary to keep as many images as possible 
for training. For each of the runs, TensorFlow logs important training information that users 
can access via TensorBoard [52]. 
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Table 4.7. Summary of ship detector training iterations. 


Iteration Name 

Train (+/-) 

Valid. (+/-) 

Distortions(Value) 

ship_images_orig 

99/88 

11/9 

None 

ship_images_added_default 

106 / 323 

11/32 

None 

ship_images_added_brightl 0 

106 / 323 

11/32 

Random Bright. (10%) 

ship_images_added_croplO 

106 / 323 

11/32 

Random Crop (10%) 

ship_images_added_scalelO 

106 / 323 

11/32 

Random Scale (10%) 

ship_images_addedjiip 

106 / 323 

11/32 

Random Flip (T) 

ship_images_added_bsc 1 Oflip 

106 / 323 

11/32 

All Distortions (10%), 
Random Flip (T) 


The train and validation numbers represent the number of ship (+) and no ship (-) 
examples available during training. For the last model, we set random brightness, 
scale, and crop to ten percent and also the random flip flag to true in order to 
observe the affect on performance of combining the distortions. 


Figure 4.1 displays the performance of our first model using only Camp’s image set; it 
classified the validation set perfectly. Due to high training and validation accuracy, we did 
not train additional models with only these images. Instead, we added our Ship Internet 
Images training set to Camp’s and repeated the training process. We apply the same defaults 
as above, but vary the distortions as depicted in Table 4.7. 

From Figure 4.2, a screen capture of TensorBoard, we observe the validation accuracy 
to select the best model for another round of training. Examining the plot, we see the 
distortions do not improve the performance. Thinking about the Inception-v3 model’s 
initial training, one can derive a logical reason for the distortions’ apparent ineffectiveness. 
Google originally trained the lower layers reading the raw pixel inputs on distorted images. 
Thus, the low-level feature extractors that distortions affect most are already invariant to 
them. Realizing this, we did not employ them for any further retraining iterations. 

Figure 4.2 presents classification accuracy on the validation set plotted against the number 
of training steps completed. The lighter lines depict raw accuracy measures, recorded every 
10 training steps, whereas the darker lines represent a smoothed value from these measures. 
The model without any distortions slightly out performs the others at its final iteration. 
Selecting this as the best model to train for another round, we decreased the learning rate 
to 0.001 and increased the training steps to 4,000. Figure 4.3 presents a comparison of 
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Name Smoothed 

^ shjpJmages_(lefaiins\retrair)_logs\train 1.000 

O shipJmages_defauns\retrair)_logs\validation 1.000 

Vdue 

1.000 

1.000 

Step 

999.0 

999.0 




0.000 100.0 200.0 300.0 400.0 500.0 600.0 700.0 800.0 900.0 1.000k 


Figure 4.1. Ship detector training and validation accuracy vs. training step 
with only Camp’s [16] images from TensorBoard. This plot and legend 
are screen captures from the web-browser based TensorBoard application 
that is a part of TensorFlow. The interactive plot allows a user to slide 
his mouse along the training steps to observe values stored at each step. 
The legend displays the final values for each model. We capitalize on our 
goal to present the functionality of TensorFlow to avoid exporting this data 
via comma-separated values and generating our own plots; similar figures 
throughout this thesis are also from TensorBoard. 


the validation accuracy for our top-performing initial model {ship_images_added_defaults) 
and its second-round configurations {ship_images_added_defaults_rnd_2). 

As is evident in Figure 4.3, validation accuracy remains roughly constant after 1,000 training 
steps. This observation, combined with our desire to avoid overfitting due to the fact that 
we conduct training with a relatively small data set, encourages a selection of the original 
model completing only 1,000 training steps. We look elsewhere for confirmation of this 
intuition. The retraining script also logs the softmax layer’s cross-entropy loss, a measure 
of the difference between the model’s predicted category scores for an image compared to 
the labeled true category, which training strives to minimize. Observing Figure 4.4, the 
shallow descent and higher final cross-entropy loss of the second round model suggest we 
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0.9400 

0.9400 

999.0 

0 shipJmages_adde(Lcrop10\retrain_logs\validation 

0.9200 

0.9200 

999.0 

^ shipjmages_a(yied_bnght10\retraiii_logs\validatkin 

0.9100 

0.9100 

999.0 
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0.9100 
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Figure 4.2. Ship detector validation accuracy vs. training step for first round 
of training. This figure displays the ineffectiveness of the distortions (de¬ 
scribed in Table 4.7) in improving accuracy on the validation set. The 
top-performing model {ship_images_added_defaults), did not employ any 
distortions and we carry it forward for further comparisons in Figure 4.4. 


set the learning rate too low. The eombination of these two pieees of evidenee provide 
eonfidence in selecting the default model {ship_images_added_defaults) trained for 1,000 
steps to evaluate the test sets against. 


4.2.5 Evaluating Retrained Inception-v3 on Ship Classification 

We evaluate the retrained model with a nearly identical process to the one described in 
Section 4.1.3. We maintain the separation of Camp’s [16] test set with our Ship Internet 
Images test set and redo this evaluation process to store the scores for ship and no ship 
returned for each image. We do not apply distortions during the evaluation of the test 
images besides rescaling and cropping to reduce the image to the required Inception-v3 
model input of 299 by 299 pixels. 
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Figure 4.3. Ship detector validation accuracy vs. training step for sec¬ 
ond round of training. The ship_images_added_defaults model is 
the top-performing model from Figure 4.3. The second round model 
{ship_images_added_defaults_rnd_2) is the same model from the first 
round but we train the second model with a smaller learning rate (0.001 
instead of 0.01) and for an extra 3,000 training steps. 

4.2.6 Ship Experiment Performance Metrics 

Camp does not directly describe his calculations for recall and precision; we assume he 
employs the standard equations presented in Section 4.1.4. He does however present his 
results by ROC curves. We repeat our preceding method to develop our ROC curves to 
allow a direct comparison of performance to Camp’s curves in Chapter 5. 
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4.3 Screenshot Detection in Images 

Following suit, this section will discuss the image set, training and evaluation processes, 
and metrics for comparing performance between modern algorithms built in TensorFlow 
and those employed by Sharpe [17] in order to detect screenshots in images. 
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Figure 4.4. Training and validation cross-entropy vs. training step for both 
rounds of training. 

4.3.1 Screenshot Image Sets 

Training a deep CNN from serateh demands signifieantly more labeled examples than trans¬ 
fer learning. As presented in Chapter 2, Sharpe [17] gathered over 6,000 images for her 
experiment. While substantial, her image set pales in eomparison to those as large as the Im- 
ageNet dataset. Also, because her algorithms employ engineered feature extractors instead 
of learning them from scratch, she only required ten percent of these images for training. 
Needing to test our models against at least 4,800 of her images for a true comparison, we set 
out to find other examples for training. First, we select ten ImageNet categories representa¬ 
tive of objects or scenes in Sharpe’s [17] negative image set. Next, converting 192 videos 
from Wikimedia Commons’ Screencast videos [53] category we capture one frame as a 
JPEG for every 30 frames of each video. This method produced 11,428 screenshot exam¬ 
ples. Finally, we wrote a script to automatically capture 111 screenshots while performing 
various tasks on a personal computer. Within the ImageNet negative examples, there existed 
approximately 300 digitally augmented images that resembled screenshots. After removing 
them, we narrow down our training set to 11,605 screenshot and 12,789 other images. We 
split these images by reserving about ten percent of them for a validation set to arrive at the 


Name 

Smoothed 

Value 

Step 

ship_images_a(1ded_defaults\retrain_logs\train 

0.02607 

0.02607 

999.0 

ship_images_added_defauRs\re(rain_logs\vatdation 

0.07944 

0.07944 

999.0 

shipJmages_adde(Ldefautts_md_2\retrajn_logs\train 

0.04856 

0.04856 

3.999k 

O ship_irnaoes_adde<Ldefaults_mCL2\re1rairi_IOQS\vali(latioii 

0.1474 

0.1474 

3.999k 



0.000 500 0 1.000k 1 500k 2.000k 2.500k 3 000k 3 500k 4 000k 
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numbers depicted in Table 4.8. Since we did not want to contaminate the validation set, we 
moved groups of screenshot examples derived from the same video into the validation set. 
This ensured there were no images from the same video in both the training and validation 
set. By building this Screenshots dataset we reserve all of Sharpe’s [17] images for testing. 


Table 4.8. Training, validation, and test sets for the screenshot experiment. 


Training Sets 

Category 

Number of Images 


house 

1,231 


range, mountain range,... 

2,029 


government building 

334 


building, edifice 

1,421 

ImageNet [I] 

coast 

littoral, litoral,... 

321 

1,932 


furniture, piece of furniture,... 

2,138 


tree 

1,181 


divided highway, dual carriageway 

1,213 


street 

1,330 

Screenshots 

screenshot 

10,414 

other 

11,513 

Validation Set 

Screenshots 

screenshot 

1,191 

other 

1,276 

Test Set 

Screenshots (Sharpe [17]) 

screenshot 

other 

2,420 

3,644 


Before beginning training, we check that we did not make the critical mistake of contam¬ 
inating the test set. To do this we must compare all of the images in the training and 
validation sets to those in the test sets to be certain no duplicates exist. By employing a 
hash algorithm on the images’ content we can quickly and reliably compare a large number 
of images. We first enter all of the training and validation images into a hash database by 
running them through a difference hash programmed in Python [54]. The difference hash 
tracks gradients in images to match duplicates [55]. We then check all of Sharpe’s [17] test 
images against the hash database. In our case, we found two duplicates from our ImageNet 
negative examples. We remove these images from our training set to leave the test set 
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intact. During the image hashing, the difference hash also identified that approximately 
5,500 of the 11,605 screenshot examples are near duplicates. While unsurprising due to 
our screenshot generation methods, this is worth noting as a weakness in our dataset. We 
left the near duplicates as slight variations in the screenshots still provide useful training 
examples; at worst they cause overfitting of the model, which we can identify during training 
if the model performs poorly on our validation set. As a final note, we conducted this image 
hashing process for each of our experiments but save mentioning it until now as it was most 
important for this large training set gathered from similar sources to the test set. 

4.3.2 Training Modern Models to Identify Screenshots 

The training process involves two distinct steps: first, import the Screenshots training, 
validation, and test sets into a TensorFlow Dataset [56] and second, run the TensorFlow- 
Slim training functions to initialize and train the three models presented in Chapter 2 from 
scratch. 

After building the Screenshots sets described above, we convert the images into a single 
TensorFlow Dataset. We read all of the images from our sets into a TensorFlow file format 
known as a TensorFlow Record (TFRecord) [57]. This format stores the information from 
thousands of JPEG files into a handful of TFRecord files that facilitate fast streaming of 
images into memory without opening thousands of separate files from the disk. While 
importing Sharpe’s [17] grayscale negative example images into this format, we repeat the 
single channel pixel values three times for each of the red, green, and blue channel to allow 
our model to receive the expected three channel input size when evaluating these images. 
We then associate the TFRecord files for each set into a TensorFlow-Slim defined Python 
object known as a Dataset. This allows us to point TensorFlow to a desired image set by 
passing the Dataset name, in our case Screenshots, and a split name: training, validation, 
or test. Using this object, TensorFlow supervises which images get passed to the model 
for training to ensure the network sees all training examples before repeating the images 
for subsequent epochs. Our use of TensorFlow’s Dataset class in our experiments provides 
access to desired image sets in a manner that prevents contamination among our three splits 
and facilitates the use of our new dataset with existing TensorFlow training and evaluation 
functions, ffaving efficient access to the image content and associated category labels 
necessary for our experiments, we stand ready to train the models. 
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Using a TensorFlow-Slim script [58], we supply the Screenshot Dataset training split and 
the following inputs to train the three models covered in Chapter 2: 

• batch size = 25 

• max number of steps = 50,000 

• optimizer = adam 

• model name = vgg_16, resnet_v2_50, or inception_resnet_v2 

With this batch size, number of training steps, and the number of training images available, 
we train each model for approximately 57 epochs. Table 4.9 depicts the training time for all 
training iterations of our models. We run the Adam [59] optimizer, with TensorFlow’s de¬ 
fault configurations, to adjust the learning rate throughout training as depicted in Figure 4.5. 
We train all three models on a Windows 10 desktop employing TensorFlow r 1.0.1, Python 
3.5.3, and both the 1080 Ti GPU and i7 CPU from Table 4.2. To start the actual training, 
the CPU loads images from the Screenshot Dataset object and then conducts the prescribed 
preprocessing steps for each model [12], [13], [30], [31]. Provided a properly formatted 
Dataset object, TensorFlow-Slim contains scripts to quickly execute these preprocessing 
steps and save the correctly-sized images for input to a specific network in a queue in the 
CPU’s memory. The GPU then requests batches of these images for each training step. Ten¬ 
sorFlow manages this queue which enables more efficient training of a network on the GPU 
by reserving its processing power for the forward pass and back propagation of gradients to 
update the network’s weights. The queue also minimizes the time the GPU has to wait on 
the preprocessing of images. As a separate process on the CPU, we run an evaluation step to 
capture a model’s performance on 500 images from the validation set every 1,000 seconds 
and then on all validation images at the completion of training. We repeat these steps for 
each model listed and use TensorBoard to produce the plots in Figure 4.6 and Figure 4.7. 

As Figure 4.6 and Figure 4.7 show, the ResNet model achieves the best performance in 
both training and evaluation on the validation set at the final training iteration. Looking 
closer at Figure 4.7, at around 35,000 steps both Inception-ResNet-v2 and VGGNet start to 
perform worse during evaluation. This drop, after a period of about 15,000 training steps 
with consistent performance in terms of validation accuracy, could mean that the models 
are overfitting to the training set. Still, both models seem to recover and achieve comparable 
performance to the ResNet model on the entire validation set after 50,000 steps. Since we 
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Table 4.9. Training times for two iterations of three screenshot models. 


Model 

Training Steps 

Training Time 

Round 1: Defaults 

VGGNet 


8h 

ResNet 

50,000 

4h 

Inception-ResNet-v2 


12h 

Round 2 

VGGNet 


5h 

ResNet 

30,000 

3h 

Ineeption-ResNet-v2 


9h 


All times are for training with the 1080 Ti GPU and i7 CPU from Table 4.2. Times 
are rounded up to the nearest hour. 


0.0100 

9.000e-3 

8.000e-3 

7.000e-3 

6.000e-3 

5.000e-3 

4.000e-3 

3.000e-3 

2.000e-3 



Figure 4.5. Learning rate vs. training steps for all three models. TensorFlow- 
Slim’s training methods adjust the learning rate throughout the training pro¬ 
cess by employing TensorFlow’s implementation of the Adam [59] optimizer. 
This plot was identical for all three models. 
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do not get significant separation of performance on the validation set we repeat the training 
with two adjustments. First, rather than classifying only 500 of the validation images every 
1,000 seconds we evaluate the entire set every 2,000 seconds. We increase the interval 
since this operation slows the training process by consuming most of the CPU’s resources 
for evaluation. Second, we decrease the number of training steps to 30,000 to test our 
hypothesis of overfitting. Figure 4.8 and Figure 4.9 depict the cross-entropy and validation 
set accuracy for each model during this second round of training. 
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Figure 4.6. Screenshot detector cross-entropy vs. training step for round 
one. 


In round two, the Inception-ResNet-v2 achieves the best performance and scores the highest 
validation set accuracy out of all models from both rounds of training. However, the ResNet 
model from round one still scored the lowest cross-entropy loss out of all of the models. Due 
to these inconsistencies, and some concerns about our newly built validation set accurately 
representing Sharpe’s [17] test set, we carry both iterations of the VGGNet, ResNet, and 
Inception-ResNet-v2 models forward to compare performance on the Screenshots test set. 
Best practice suggests training hundreds of iterations of these models, adjusting training pa¬ 
rameters for each, to achieve maximum performance. We forgo this hyperparameter search 
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Figure 4.7. Screenshot detector validation set accuracy vs. training step for 
round one. 


due to strong performance on the validation set, time and compute resource constraints, and 
TensorFlow providing the recommended configurations from each algorithm’s creators. 


4.3.3 Evaluating Deep CNNs as Screenshot Detectors 

The evaluation method for this experiment varies due to our implementation of a 
TensorFlow-Slim Dataset. All software and hardware remains the same from Section 4.3.2. 
We run our evaluation on all of the available 6,064 archived images from Sharpe’s [17] 
experiments. In contrast, she only evaluates performance through ten-fold cross validation 
on a selection of 4,800 of these images. With no record of which 4,800 images Sharpe [17] 
included, we evaluate our best models on all available examples. Similar to training, we 
engage the CPU to complete model-specific preprocessing steps to generate a queue of 
images. 

For the VGGNet and ResNet models, with pixel values starting as integers between zero 
and 255, the evaluation preprocessing steps are as follows: 
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Figure 4.8. Screenshot detector cross-entropy vs. training step for round 
two. 

1. Resize the image so that the smallest side equals 224 pixels. 

2. Conduet a eentral erop to mateh the longer side to 224 pixels while maintaining the 
original aspeet ratio. 

3. Subtraet the training set mean red, blue, and green pixel values from eaeh pixel in the 
image. 

For the Inception-ResNet-v2 model, with pixel values starting as 32-bit floats in range 
between zero and one, here are the steps: 

1. Conduet a square-central crop containing 87.5% of the original image area. 

2. Execute a bilinear resize on the image to the match a 299 by 299 input size. 

3. Subtract 0.5 from all pixel values. 

4. Multiply all pixel values by 2.0 to make range between one and negative one. 

After preprocessing, TensorFlow sends batches of 100 images to the GPU and runs them 
through the current model to produce a score for ship and no ship. Once the 61 batches 
finish, we store each image filename and associated scores in a comma-separated value file 
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Figure 4.9. Screenshot detector validation set accuracy vs. training step for 
round two. 
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for evaluating model performance. Table 4.10 presents the time to evaluate all three models 
on the test set images. 


Table 4.10. Screenshot models’ image classification times. 


Model 

Total Time 

Time per Image 

Images / sec 

GPU 

VGGNet 

49.4 

0.00814 

122 

ResNet 

38.1 

0.00628 

159 

Inception-ResNet-v2 

86.3 

0.0142 

70 


This table displays a measure of the time, in seconds, required to classify images 
with deep CNNs in TensorFlow when using the TFRecord format and evaluating 
batches of images simultaneously. We produce these times with the described 
evaluation process in Section 4.3.3 by employing the 1080 Ti GPU and i7 CPU 
from Table 4.2 and evaluating 6,064 test images. The 1080 Ti has more resources 
available than the TITAN Black GPU that produced the times in Table 4.3. There¬ 
fore, we cannot compare the lnception-ResNet-v2 model in both tables to attempt 
to measure specific time savings with the TFRecord format. 
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4.3.4 Screenshot Performance Metrics 

We calculate the same metrics defined in Section 4.1.4 to grade the top model’s performance. 
Sharpe [17] does not provide the equations she implemented for her performance metrics 
so we assume she follows the standard definitions. Using thre process described previously, 
we summarize performance in a ROC curve and present the results in Chapter 5. 
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CHAPTER 5: 
Results and Discussion 


With our experiments complete, this chapter presents and discusses results with the goal 
of answering the first four research questions from Section 1.6. Following the order of 
introduction in Chapter 4, we provide ROC curves and F-scores for the three detectors and 
compare their classification performance against previous methods. For each experiment, 
we will examine where the best algorithm fails in an attempt to better understand which 
features the algorithm learned or missed during training. In Section 5.4 we analyze the 
training and evaluation times presented in Chapter 4 in order to assess the feasibility of 
TensorFlow-based image classification in deployed and resource-constrained environments. 
In summary, we present our empirical measure of TensorFlow in this chapter. 

5.1 Performance of the Inception Models as AK-47 
Detectors 

This section examines the pretrained Inception models’ performance as AK-47 detectors. 
To start, we compare performance across two category scores and their sum: AR; rifle; 
and AR + rifle. Next, we observe the five Inception models’ performance on Jones’ [15] 
images. Third, we present performance of the best Inception model, selected by highest 
F-score on Jones’ [15] test set, against the two additional image sets to more vigorously test 
its classification performance. Finally, we examine the misclassifications on all three test 
sets to provide further insight into the behavior of the Inception models as AK-47 detectors. 

5.1.1 Three Score Comparison 

Recording our three separate category-classification scores proves unnecessary. Figure 5.1 
displays the ROC curve for the first Inception model on all three test sets. For Inception- 
vl, the rifle category performed its best relative to the AR score as an AK-47 detector. 
Still, it scored a lower F-score than the AR category individually and did not improve the 
F-score when added to the AR category. Looking at the latest model. Figure 5.2 depicts 
the same plot for Inception-ResNet-v2. Here the rifle category again performs worse and 
does not improve F-score when added to the AR score. As a result of these findings and 
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a preference for simplicity, we consider only the AR category score for our remaining 
comparisons. All of the Inception model ROC plots of this three score comparison, and 
tables containing corresponding performance metrics at the best F-score threshold, are 
available in Appendix A. 1.1. 



False Positive Rate 


Figure 5.1. Inception-vl ROC curve for three category scores on all three test 
sets combined. All of our ROC curves in this section plot the recall versus 
FPR at the 1,000 score thresholds as described in detail in Section 4.1.4. 
The lines on these plots connect these values without performing any sort 
of fitting. The circles plotted on the graph represent the threshold for the 
highest F-score; we include the recall (TPR) and FPR values generating this 
F-score in each legend. 


64 




100 % 



> 


‘to 

<2 50% 

<D 

= 40% 

u 

<D 

cd 

30% 


20% 

10% 

0 % 

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 

False Positive Rate 

Figure 5.2. lnception-ResNet-v2 ROC curve for three category scores on all 
three test sets combined. 

5.1.2 AK-47 Detector Performance on Jones’ Test Images 

The Inception models perform well as AK-47 detectors without any additional training 
and show significant performance increases from the first iteration to the latest Inception- 
ResNet-v2 model. Figure 5.3 provides the ROC curve for all five Inception models evaluated 
on Jones’ [15] test set images; in the Appendix, Figure A.6 shows a closer view of the same 
curves. At the best F-score threshold, from vl to ResNet-v2, the recall rate increased 
by eight percentage points while the FPR dropped by a factor of three. At its optimal 
threshold, the Inception-ResNet-v2 correctly identified 645 of 687 AK-47 images while 
only incorrectly identifying 10 of 7,045 negative images as containing an AK-47. 


- ResNet-v2 AR 

- ResNet-v2 rifle 

- ResNet-v2 AR + rifle 

O Max F-score: 0.94 (FPR:0.29%, TPR:91.56%) 
O Max F-score: 0.92 (FPR:0.60%, TPR:90.55%) 
O Max F-score: 0.94 {FPR:0.45%, TPR:92.57%) 
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For the reasons diseussed in Seetion 4.1.5, we eannot direetly eompare the performanee 
of the Ineeption models to the algorithms implemented by Jones [15]. In order to avoid 
misleading the reader we do not provide Jones’ [15] ROC eurve in this seetion for a side- 
by-side eomparison. Still, we ean observe the reeall when the FPR equals zero to obtain a 
sense of how strongly the Ineeption-ResNet-v2 model outperforms Jones’ [15] algorithms. 
Instead of using the best F-seore threshold for our elassifier, we instead use the highest 
AR eategory seore that generated a false positive on a negative example. Classifying any 
image with a seore above this threshold as an AK-47, we obtain a FPR of zero and a 
reeall of 86.2%. At this new threshold, our algorithm still elassifies 592 out of 687 AK-47 
examples eorreetly without generating any false positives. Observing the left side of the 
ROC eurves from Figure 2.2, all of the algorithms begin to generate false positives before 
produeing reeall rates above 78%. This simple eomparison removes our largest differenee 
in ealeulations, the FPR rate, and inspires eonfidenee that the Ineeption models improve 
elassifieation error rates in their out-of-the-box eonfiguration. See Appendix A. 1.2 for a 
eloser view of Figure 5.3 and also a table of key performanee metrie values at the best 
F-seore threshold for eaeh model. 

5.1.3 AK-47 Detector Performance on Internet Images 

Desiring to streteh the best performing Ineeption model further, we evaluate Ineeption- 
ResNet-v2 on the AK-47 Internet Images and Rifle-Like Images test sets deseribed in See¬ 
tion 4.1.1. Figure 5.4 portrays the ROC eurves for the Ineeption-ResNet-v2 model on these 
two new test sets with the eurve produeed previously for Jones’ [15] test set. Not surpris¬ 
ingly, performanee of the model drops when we test it against more naturally oeeurring 
images of AK-47s versus the AK-47 video stills. On the AK-47 Internet Images test set, 
at its best threshold the Ineeption-ResNet-v2 model elassifies 98 out of 107 of the images 
eontaining AK-47s eorreetly, and only produees two false positives out of 143 negative ex¬ 
amples. Even when attempting to fool the model by providing images of rifle-like objeets, 
the Ineeption-ResNet-v2 model performs reasonably well; it only elassifies 18 of 100 of 
the images ineorreetly. As before. Appendix A. 1.3 eontains more detailed metries for these 
tests. 
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False Positive Rate 

Figure 5.3. AK-47 detector performance using the AR category for all In¬ 
ception models on Jones’ [15] test set. 


5.1.4 AK-47 Detector Missteps Analysis 

An analysis of the Inception-ResNet-v2 model’s missteps provides insight into the algo¬ 
rithms performanee and the kinds of mistakes it makes. Interesting trends exist in these 
missteps in terms of both false negatives and false positives. Starting with false negatives, 
we first explore failures in Jones’ [15] test set due to the opportunity to compare stills from 
the same video that the algorithm classified correctly to those it missed. In about 30 of the 
false negatives, the algorithm missed when an object of a known category in the background 
became the central focus of an image. For example, a video sequence of a man shooting 
an AK-47 in front of a mobile home produced 24 of the false negatives. The algorithm 
missed when the camera zoomed out and included the entire mobile home in the center of 
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Figure 5.4. Best AK-47 detector performance using the AR category on 
internet test sets. 


- Jones Video Frames 

- Internet Images 

- Internet + Rifle-Like Images 

O Max F-score: 0.96 (FPR:0.14%, TPR:93.89%) 
O Max F-score: 0.94 (FPR:1.40%, TPR:91.59%) 
O Max F-score: 0.85 {FPR:7.82%, TPR:88,79%) 


the image behind the shooter. Table 5.1 presents the top-three elassifieation seores for a 
distant-view image, whieh Ineeption-ResNet-v2 miselassified at our best F-seore threshold, 
and the seores for the elose-up view of the AK-47 whieh the model classified correctly. In 
the distant view, the algorithm detected the mobile home so strongly that it did not produce 
a significant score in any other category. 
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Table 5.1. AK-47 detector top-three category score comparison for false 
negatives versus true positives in same video sequences. 


Image 

Cat. 1 

Score 

Cat. 2 

Score 

Cat. 3 

Score 

Mob. Home(-) 

mobile home 

0.946 

revolver 

0.002 

boathouse 

0.001 

Mob. Home(+) 

assault rifle 

0.586 

mobile home 

0.159 

rifle 

0.158 

Lawn(-) 

giant schnauzer 

0.121 

Appenzeller 

0.080 

lawn mower 

0.064 

Lawn(+) 

assault rifle 

0.755 

chainsaw 

0.086 

rifle 

0.065 

Boots(-) 

cowboy boot 

0.121 

plow 

0.080 

bow 

0.064 

Boots(+) 

assault rifle 

0.070 

valley 

0.060 

worm fence 

0.046 


This table compares a false negative (-) and true positive (-I-) frame from the 
same videos of Jones' [15] test set to describe causes of error. We include only 
the first category description instead of the full name of each ILSVRC category. 


In a second instance of failure, pictured in Figure 5.5, a woman fires an AK-47 on a bright 
green lawn. The Inception model classifies all but the last two frames of this sequence 
correctly. When the woman moves the weapon from the firing position to below her hips 
and turns towards the camera it appears that the lawn in the background strongly influences 
the classification scores. Table 5.1 shows that the classifier produces its strongest scores for 
two dog breeds and a lawn mower for the last two frames. Examining the ILSVRC 2012 
training set for the dog categories, we find that a significant portion of the images are of 
the dogs sitting on bright green lawns. The third highest classification score of lawn mower 
also supports this hypothesis. While the algorithm performed well enough on the ILSVRC 
2014 challenge for us to know that it is not simply relying on bright green backgrounds to 
detect these categories, it is interesting to note that the presence of grass does significantly 
influence the category scores. 
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(a) True positive: Lawn (+). (b) False negative: Lawn (-). 

Figure 5.5. Stills from Jones’ [15] test set of an AK-47 video from YouTube. 
See Table 5.1 for these stills’ top three category scores. Adapted from [60]. 


A final phenomenon for diseussion that negatively alFeeted AK-47 deteetion were stills in 
whieh a eamera lost foeus and an image beeame pixelated around the AK-47, perhaps due 
to rapid motion of the weapon during firing. In the third example pair in Table 5.2, this 
loss of foeus led our algorithm to piek up the image’s more elearly defined objeets, like the 
boots on a man in the baekground of the image. These three examples eover a majority of 
the false negatives produeed by the Ineeption model and develop some intuitions for the 
eauses of failure. 

Let us shift foeus to false positives produeed by Ineeption-ResNet-v2. There are fewer 
on Jones’ [15] test set, but one trend exists, and the internet and rifle-like images produee 
the most interesting trends for diseussion. For Jones’ [15] test set, 5 of 10 false positives 
eontained military members in uniform without any weapons. Table 5.2 reeords the top 
olassifieation seores side-by-side with seores for the assault rifle eategory for these five 
images. This trend is not surprising: many of the training images for the assault rifle 
eategory eontain military members in uniform. Thus, from the model’s perspeetive a 
military uniform in a pieture inereases the likelihood that an assault rifle also exists in the 
image. Sinee assault rifle earned the top eategory seore for only one of these five images, 
and the remaining four images’ seores are elose to the best F-seore threshold, this trend does 
not produee signifieant eoneern in light of the Ineeption model elassifying other images 


70 






with military members in uniform correetly. We eould probably eorrect both this bias and 
the green lawn bias by including more training examples to counter the biases; they do 
not represent a significant algorithm flaw. Moving on to the trends present on the internet 
and rifle-like image sets, 6 of 19 false positives on these sets came from images containing 
paintball guns. Considering the similar features between assault rifles and paintball guns, 
and the fact that model does not possess a paintball gun category, this seems like a reasonable 
mistake. Nine other false positives came from images of men holding large wrenches, power 
drills, a pitch fork, and a fishing pole in carry positions common when holding a weapon, 
like leaning on a shoulder, 45 degrees across the body, or underarm. Again potential feature 
similarities make these errors seem reasonable. Another false positive was for an image 
containing the undercarriage of an automobile. The algorithm placed assault rifle as the 
eighth highest category for the image with a score of 0.022, well above the best F-score 
threshold for the rifle-like images set. Perhaps the edges and color patterns in the pipes 
and struts under the car resemble those features the algorithm learned from images of 
assault rifles. The remaining three false positives are harder to reason about and justify. 
They contain images of groups of people with no obvious patterns that could confuse the 
algorithm. Such mistakes, those that a human does not make or struggles to understand, 
are a weakness of algorithms incorporating ANNs which we will discuss more thoroughly 
in Chapter 6. 
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Table 5.2. Top category and assault rifle scores for AK-47 detector false 
positives on military members in uniform. 


Image 

Top Category 

Score 

AR Rank 

AR Score 

1 

mountain bike 

0.197 

5 

0.042 

2 

military uniform 

0.300 

4 

0.046 

3 

assault rifle 

0.265 

1 

0.265 

4 

bulletproof vest 

0.101 

3 

0.072 

5 

stretcher 

0.188 

4 

0.055 


This table provides the top category scores for five false positive images from 
Jones’ [15] test set on images that contained military members in uniform. Any 
score for the AR category above a threshold of 0.036 produced a false positive. The 
first image was of a Marine in camouflage dancing next to two children on bicycles. 
The remaining four images were from the same video sequences of army soldiers 
running a relay race in camouflage in front of a building. The table includes the 
rank, with one being the highest, that the assault rifle category score received out 
of all 1,000 possible classifications. We include only the first category description 
instead of the full name of each ILSVRC category. 


5.2 Retrained Inception-v3 Ship Detector Performance 

This section includes our results from applying transfer learning techniques and retraining 
the Inception-v3 model as discussed in detail in Section 4.2.4. It is important to remember 
that the models discussed in this section benefit from previous training on the 1,000 cate¬ 
gories of images from ImageNet and that we only update the last fully connected layer. We 
present performance of the first retrained model using only Camp’s [16] training images 
on his test set. Next, we provide the results for our second model iteration, which we 
retrained by adding the Ship Internet Images training set to Camp’s [16] training set, on 
both Camp’s [16] and the Ship Internet Images test sets, defined previously in Table 4.4. 
Finally, we explore missteps of the retrained Inception-v3 model by examining images it 
misclassified. 
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5.2.1 Ship Detector Performance on Camp’s Test Images 

The retrained Inception-v3 model achieves near-perfect performance when classifying 
Camp’s [16] test images. Figure 5.6 depicts a comparison of Inception-v3 to Camp’s [16] 
best performing model at the four largest scales. As shown in this figure, the Inception algo¬ 
rithm only misclassifies a single no ship image at the 15% scale. Moving to Figure 5.7, we 
provide a closer view of the retrained Inception-v3 model’s performance at seven scales to 
compare to Camp’s [16] best performing model across all scales. Observing the Inception- 
v3’s performance at the smallest scale, we see that it outperforms Camp’s [16] HYBRID 
model at full scale. The Inception-v3 model achieves strong scale invariance as a ship 
detector; a desirable trait for the UAS use case developed in Section 1.4. See Appendix A.2 
for an ROC curve depicting all eight scales and a table of performance metrics at the best 
F-score threshold for each scale. 
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Figure 5.6. lnception-v3 ship detector performance when retrained on 
Camp’s [16] training set compared to his DPM detector. Camp’s [16] DPM 
detector achieved the highest performance of all his algorithms at scales 
above 50%. Due to poor performance at smaller scales Camp [16] only 
presents down to 25% scale in this plot. Source(b): [16, Figure 18]. 
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(a) lnception-v3 ROC curve close-up. 



Figure 5.7. lnception-v3 ship detector performance when retrained on 
Camp’s [16] training set compared to HYBRID detector. Camp’s [16] HY¬ 
BRID detector achieved the best performance across all image scales. We 
provide a close-up view of the ROC depicted in Figure 5.6 including smaller 
scales to match Camp’s [16] plot. Source(b): [16, Figure 18]. 
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Because he separates the ROC curves by scale, Camp [16] does not test his algorithms’ 
ability to use a single threshold and classify images across multiple scales. This capability 
would add utility to the ship detector we describe in our UAS scenario from Section 1.4. 
Thus, in Figure 5.8 we combine images from full scale down to 25% scale in order to measure 
the retrained Inception model’s classification performance across the scales. Examining 
the Camp Training Only curve, we see that Inception-v3 achieves scale invariance and 
still retains near perfect performance when performing classification using the single best 
threshold of 0.450 for multiple scales. The algorithm generates a single false positive 
and false negative across four different scales of each of the 405 positive examples and 99 
negative examples. Still, identifying a small-scale ship when it is the focus of the image 
remains an easier problem than having to identify the ship in an image covering a long 
horizon; this harder problem is the focus of our next tests. 

5.2.2 Ship Detector Performance on Internet Images 

This section describes the performance of the Added Images model produced by a second 
iteration of retraining that incorporated training and test images from the more challenging 
Ship Internet Images set. We evaluate i\\Q Added Images model (described in Section 4.2.4) 
that achieves the highest validation set accuracy during training while only completing 
1,000 training steps. This model had additional no ship and ship examples available as 
described in Section 4.2.1. Figure 5.8 displays a comparison of the performance of the first 
iteration of Inception-v3, the Camp Training Only model, to our Added Images model on 
Camp’s [16] test set across four scales. The Added Images model performs worse but not 
significantly; it still achieves an F-score of 0.99. The Added Images model does generate 
20 false positives out of 396 negative examples; however, 14 of these errors are repeats of 
the same images at different scales. Moreover, it only generates two false negatives across 
all four scales of ship examples. 

Confident that om Added Images model still performs well on Camp’s [16] images, we 
provide a depiction of the algorithm’s stronger resistance to false positives on the Ship 
Internet Images test set in Figure 5.9. While at some thresholds the Camp Training Only 
model performs slightly better than a random guess, its best F-score still comes from 
classifying all 120 difficult test images as ships. Comparing the algorithms at their best 
F-score thresholds, the Added Images model increases its F-score over the Camp Training 
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Figure 5.8. lnception-v3 iterations compared on Camp’s [16] test set from 
full to 25% scale combined. This figure shows performance of the first 
iteration lnception-v3 model, retrained only on Camp’s [16] images, and the 
second iteration, retrained with the added images discussed in Section 4.2.1. 
These curves show performance of the models across Camp’s [16] test images 
combining the following scales: full, 75%, 50%, and 25%. We discuss the 
second iteration model performance in Section 5.2.2. 


Only model by 15 percentage points. At its best F-score threshold, the Added Images model 
manages to classify 53 of 60 ship examples correctly while generating 24 false positives 
out of 60 negative examples. The Added Images model does not perform well enough 
to implement as is; yet, our results support that with more relevant training examples 
Inception-v3 can learn to identify ships on the horizon and distinguish large ocean debris 
from ships. 
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False Positive Rate 

Figure 5.9. lnception-v3 iterations compared on the Ship Internet Images 
test set. 


5.2.3 Ship Detector Missteps Analysis 

This section discusses the errors for both iterations of the ship detector on Camp’s [16] 
test set and the Ship Internet Images set. We start with the first Inception-v3 iteration (the 
Camp Training Only model) evaluated across four scales. At its best F-score threshold the 
model misses two images at scales above 20%. It classifies a full-scale image of a large ice 
shelf on a hazy horizon as a ship. It is easy to imagine the algorithm identifying similar 
features to that of a ship’s bow at the pointed edges of the ice, driving the ship category 
score to a value of 0.516 which fell above the 0.450 threshold. Looking at the single false 
negative, the Camp Training Only model classifies a 75% scale image of a cargo ship as 
not containing a ship. The cargo ship in this image sits in front of a coastline with a water 
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tower in the distance. Due to Camp’s [16] training set containing several negative examples 
with coastal structures, the algorithm may have been overly influenced by the presence of 
the water tower. This ship image scored 0.446 for the ship category, only 0.004 below the 
threshold to classify it as a ship. We omit a discussion of algorithm misses below 25% scale 
as the misclassifications do not contain obvious trends. 

Examining the Added Images model misses on the Ship Internet Images test set, we find 
two trends worth discussing. In terms of false positives, even with the additional land- 
based shipping containers in the training set, the algorithm still classifies eight images with 
shipping containers floating in the water as ships. It also classifies five box-like pieces of 
trash as ships; Figure 5.10 depicts two of these images. The distinct linear features of these 
images are similar to that of many of the training images for ships. All seven of the false 
negatives came from challenging images of ships distant on the horizon, again Figure 5.10 
depicts two of these example images. Since we know the algorithm can detect ships at 
smaller scales, there are two options to improve performance on such examples. Either 
modify the algorithm to slide a window along the horizon, feeding less of the entire view 
through the algorithm at a time, or simply provide more positive training examples with 
ships at this distance and in different positions in the frame. These two trends suggest a 
need for more training examples; we did not spend the time to procure them as building the 
additional internet images set already went beyond the original scope of this thesis. Still, 
this discussion should serve as a starting point for the next steps in developing a training set 
for another iteration of this algorithm in pursuit of refining it for operational use. 
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(c) False positive. Source: [63]. (d) False negative. Source: [64]. 

Figure 5.10. Example error images from the Ship Internet Images test set. 


5.3 Screenshot Detector Performance for CNNs Trained 
from Scratch 

This section presents the performance of Oxford’s VGGNet, Google’s Inception-ResNet- 
v2, and Microsoft’s ResNet-v2 (50 layer version) as screenshot detectors. Our best model 
achieves the same high-level of performance as Sharpe’s [17] best combination of all four 
feature sets. Figure 5.11 depicts the ROC curves for both iterations of the three models. 
All of the trained models show promise as screenshot detectors but the Inception-ResNet- 
v2 model, trained for 50,000 steps, stands out as the top-performer. At its best F-score 
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threshold, this model achieves the highest recall by three percentage points over the next 
closest model by correctly identifying 2,314 of 2,420 screenshot examples. At the same 
time it produces the fewest false positives, classifying only 11 of 3,644 other images as 
screenshots. Sharpe [17] did not provide an ROC curve for her best performing combination 
of feature sets, but we reproduce the numbers from Table 2.4 to compare her algorithm that 
achieves the highest F-score to our best in Table 5.3. From this table we see that the 
algorithms obtain near-identical performance; Sharpe’s [17] best algorithm surpasses the 
Inception-ResNet-v2 algorithm when including the third significant digit. As discussed in 
Section 4.3.3, our algorithm evaluated all 6,064 of Sharpe’s [17] archived images while her 
numbers are averages from ten-fold cross validation across a selection of only 4,800 of the 
images. Also, mentioned in Section 2.2.3, Sharpe’s [17] negative test images are grayscale. 
This test set trait did not provide an advantage to her algorithms as they did not extract 
features based on color. In contrast, the Inception-ResNet-v2 model is able to detect color 
patterns in images; however, since we train the Inception model using only color images for 
both negative and positive training examples, we do not gain advantage from this test set 
trait either. Rather, we hinder the Inception model as it does not benefit from any useful 
color patterns it learned during training and simply has to rely on other features, such as 
higher level combinations of edges and shapes, to classify Sharpe’s [17] negative examples. 


Table 5.3. Screenshot detector performance comparison for best F-score 
models. Adapted from [17, Table 4.9]. 


Algorithm 

Accuracy 

Precision 

Recall 

F-score 

Sharpe’s All Features 
Inception-ResNet-v2 (50K) 

0.980 

0.981 

0.997 

0.995 

0.963 

0.956 

0.980 

0.975 


5.3.1 Screenshot Detector Missteps Analysis 

Once more looking at the algorithm misses, this time for the Inception-ResNet-v2 model 
at its best F-score threshold, we find three observations worthy of discussion in the false 
negatives and a single trend in the false positives. Figure 5.12 displays the images discussed 
in this section and Table 5.4 provides the images’ category scores produced by the Inception 
model. The first observation comes from the only false negative in a series of screenshots 
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False Positive Rate 

Figure 5.11. VGGNet, lnception-ResNet-v2, and ResNet models compared 
on Sharpe’s [17] entire image set. 


from a Linux terminal. Comparing the images in Figure 5.12a and Figure 5.12b, the sole 
difference is the presence of more text in the latter image. This extra text increased the 
algorithm’s confidence that the second image contained a screenshot by 0.126; enough 
to break the threshold and allow a correct classification. This observation supports the 
hypothesis that the model learned to associate typed text as a feature of screenshots. 
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Archive Editar Ver Buscar Terminal Ayuda 


root@usu%riO'A0D255E:/hofne/usuario#gedit /etc/apacheZ/mods-available/d 
onf 


Archive Editar Ver Buscar Terminal Ayuda 


root@usuSrio-AOD255E:/hO(ne/usuario# mv ’/honie/mario/svnindex.css' /var/ 
■v: no se puede ef^cti^a# «stat» sobre «/home/mario/svnindex.css»: No exl 
rdhivo 0 directorj .0 “ 

root@usuario-AOD255E:/hoine/usuario# tnv ’/home/usuario/svnindex.css' /vB 
mv: no se puede efectuar «stat» sobre «/home/usuario/svnindex.css»: No e 
archive o directorio 

root@usuario-A0D255E:/hoa»e/usuario# tnv /home/usuario/svnindex.css' /vai 
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(a) False negative. Source: [65]. 


(b) True positive. Source: [66]. 



(c) False negative. Source: [67]. (d) True positive. Source: [68]. 

Figure 5.12. Example false negatives (left) and related true positives (right) 
from the Screenshots [17] test set. 
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Table 5.4. Category scores for images in Figure 5.12. 


Image 

Screenshot 

Other 

Terminal(-) 

0.828 

0.172 

Terminal(-i-) 

0.954 

0.046 

Real Image(-) 

0.340 

0.660 

Real Image(-i-) 

0.999 

1.41E-12 


This table provides the category scores from the best lnception-ResNet-v2 model 
for the images in Figure 5.12. We classify all images using a best threshold of 0.897. 
The examples are from similar images in the Screenshots test set to facilitate a 
discussion of what features the model relies on for classification. 


A second trend comes from a pattern of misses for screenshots that include large depictions 
of real images. The best model missed 15 such images. Figure 5.12c depicts one of 
these false negatives and Figure 5.12d depicts a similar screenshot, with an even larger 
area containing real images, that the algorithm classified correctly with almost perfect 
confidence. It is interesting to note the uniform and perfectly white margins separating the 
real images in the latter image. Such features rarely occur in our other category examples 
whereas the earth-toned colors of the less-defined margins of the false negative occurred in a 
portion of the real images. It appears that the presence of the more obvious artificial margins 
allows the algorithm to detect that the real images in Figure 5.12d are a part of a screenshot. 
A final source of bias in the false negative examples is the tendency of the algorithm to 
classify screenshot examples consisting of mostly blue colors as other category images. The 
algorithm misclassified 21 blue-hued screenshots. Looking to the other category training 
set, the source of this bias becomes obvious. As discussed previously, we train the algorithm 
with categories from ImageNet including buildings, highways, coasts and more. A majority 
of these types of images contain large swaths of blue sky. In contrast, our training set 
for the screenshot category did not contain significant numbers of blue-hued screenshots; 
about five percent were mostly blue. For this reason, it appears the algorithm became 
overly responsive to the prevalence of blue pixels as a predictor for the other category. 
One potential fix for this bias is to follow Sharpe’s [17] lead and forgo considering color 
features. We could do this by converting all training and test images to grayscale during 
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image preprocessing for the model; however, we did not pursue correcting this bias and 
suspect limiting the model in this manner may have other undesired performance effects. 

Of the 11 false positives, ten contained furniture, houses, buildings or man-made structures. 
The high prevalence of linear features, large patches of consistent pixel intensity, and 
manufactured patterns in these images proved challenging for the algorithm as these feature 
prevail in screenshots. Even with these misses, the algorithm proved highly resistant to false 
positives in other images where the same features exist, but are somewhat less prevalent. 

5.4 TensorFlow Models for DOD Operating Environments 

Beyond classification performance, our experiments assess feasibility in terms of runtime 
for implementing TensorFlow-based object detectors in simulated DOD use cases. We first 
compare the training and evaluation time of our algorithms to the times from the three 
previous theses captured in Chapter 2. Then we apply our image classification runtimes to 
three potential DOD applications to develop an intuition for the prospective time savings 
available through our TensorFlow-based object detectors compared to manual analysis. 

In general, if run on the same hardware, the deep CNNs we implement in TensorFlow take 
longer to train and evaluate than preselected feature extractors of previous models. For 
example, we did not attempt to train the deep CNNs from scratch using only CPUs due 
to the expected week-long-training times. Still, the effort savings afforded to algorithm 
designers, who no longer have to manually adjust methods for feature extraction for every 
new object detector, make the extra-computational costs acceptable. Also, Table 4.6 proves 
that methods like transfer learning can negate the extra training time required, even in 
resource-constrained environments. When a classification problem warrants training a 
deep CNN from scratch. Table 4.9 illustrates a modern GPU and CPU combination allows 
our algorithms to complete training in 12 hours or less; a reasonable training cycle for 
testing many iterations of the algorithms. This is important as the algorithms often require 
hundreds of iterations of training in order to conduct a proper hyperparameter search to 
achieve optimal performance. With the largest computational costs levied during training. 
Table 4.3 shows that the slowest TensorFlow-based detectors we tested still classify images 
in less than one second in resource-constrained environments. Even though this evaluation 
runtime proves slower than previous methods, we believe the algorithms are fast enough to 
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remain relevant eonsidering their improved aceuraey and potential time savings over human- 
only analysis of images. Importantly, when leveraging eurrent hardware and optimized data 
pipelines to feed images into the algorithms, deep CNNs prove eapable of offering near- 
real-time evaluation of images. By aehieving elassifieation speeds of over 60 frames per 
seeond in sueh eases, deep CNNs show potential for analysis of live video feeds. 

In order to provide a eonerete example of how TensorFlow-based algorithms may perform in 
real-world applieations. Table 5.5 adapts the training and elassifieation times from Chapter 4 
to prediet the TensorFlow-based algorithms’ runtimes in realistie seenarios. These time 
estimates provide a rough measure of speed of elassifieation of images in three potential 
DOD applieations. From our experiments we believe the DOD ean benefit from TensorFlow 
and deep CNNs in deployed, unmanned, and enterprise level systems and environments. 


Table 5.5. Adaption of object detector runtimes in potential DOD environ¬ 
ments. 


Model 

Hardware 

Batch Size 

Images 

Time 

Deployed 

Inception-v3 

Inception-ResNet-v2 

Computer(CPU) 
Computer(CPU & GPU) 

1 

100 

5,000 

10,000 

31m 

3m 

Unmanned 

Inception-v3 

UAS(CPU) 
UAS(CPU & GPU) 

1 

3 stills 

6 stills 

1.25s 

0.25s 

Enterprise 

Inception-ResNet-v2 
ResNet-v2 (50 Layers) 

Data Center with 
lOOx (CPU & GPU) 

100 

350M 

14h 

7h 


We derive these time estimates by dividing the number of images by the images 
per second values in Tables 4.3 for rows listed with a batch size of one and 
Table 4.10 for rows with a batch size of 100. This table is meant to provide a 
simple estimate to guide the reader to the right order of magnitude for the time 
savings offered by employing the listed algorithms through TensorFlow on specific 
hardware combinations. The UAS example considers a system with three cameras 
or six cameras, providing 180 or 360 degree sensing respectively. 


TensorFlow shows promise for application by deployed military units by easing time con¬ 
straints for intelligence analysts at the battalion and squadron level and above. Employing 
only commonly available hardware, such as a laptop or desktop computer with a modern 
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CPU and at least 8GB of RAM, an analyst can now scan and tag images pulled from a con¬ 
fiscated hard drive in minutes instead of hours or days. If the analyst has access to a device 
containing a dedicated GPU our measurements show the time required decreases roughly by 
a factor of ten. Beyond the evaluation runtimes, referring back to the transfer learning times 
in Table 4.6, deployed analysts could also repurpose previously trained Inception models to 
identify images containing threat-specific objects, for example perhaps a local terror group’s 
insignia. A typical laptop or desktop computer containing only a CPU can complete this 
retraining process in under thirty minutes. The resulting algorithm will analyze images 
with the same level of runtime efficiency and possibly, depending on the object and training 
images available, the same high-level of classification accuracy displayed by our retrained 
ship detector. With hardware found in deployed environments today, TensorFlow-based 
CNNs can benefit DOD users immediately. 

TensorFlow-based algorithms’ runtimes on devices with modest resources suggests utility 
for these algorithms in UAS. Recalling the ship-detecting UAS from Section 1.4, suppose 
that system had a port, starboard and forward-facing camera capturing live video. An UAS 
employing the Inception-v3 algorithm with a standard CPU could analyze stills taken from 
these three sources, scanning a video frame from each sensor every 1.25 seconds. A version 
of this same system with a GPU and three more cameras to allow 360 degree observation, 
could analyze a video frame from the six sensors four times a second. Once the algorithm 
detects an object of interest, it will alert its operator and send a segment of the saved video, 
including frames before and after the detection, back for human review. There are further 
time savings available if the six images, one from each sensor, are sent through the algorithm 
simultaneously as a batch. With TensorFlow-based CNNs, in the near future the DOD could 
achieve real-time detection of objects of interest with UAS. 

A final area for discussion of time efficiencies is in employing TensorFlow-based algorithms 
at an enterprise level. For enterprise in the DOD, we consider a data center and the agency 
that employed Clara from Chapter 1. In terms of the number of images, we start with 
an estimated 3.25 billion images [69] shared on popular social media platforms a day. 
Assuming the intelligence agency would narrow down these images to specific sources 
and only collect about ten percent of them daily, we develop our hypothetical 325 million 
images. If the agency converted these images to TFRecords during collection and the data 
center could support 100 instances of a TensorFlow-based algorithm, we divide by the 
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images per second values from Table 4.10 and round up to the nearest hour to produce the 
estimate in Table 5.5. Although there are many more details to consider in optimizing the 
collection and storage process to facilitate efficient evaluation, it appears reasonable that our 
algorithms could flag images containing threat categories for human analyst review within 
hours of posting. 
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CHAPTER 6: 
Conclusion 


With the three experiments complete and our results presented in Chapter 5, a discussion of 
the answers to our research questions and the salient outcomes from the results remains. To 
begin, in light of the empirical results, we will answer our first four research questions from 
Section 1.6 and discuss why our results are relevant to the DOD. Next, we consider traits 
beyond just raw performance measures and seek to answer our fifth research question as to 
the high-level risks and benefits of employing an open-source machine learning library in 
the DOD. Last, to answer our sixth and final research question, we draw relevant next steps 
for the DOD from our work. 


6.1 TensorFlow-Based Object Detectors’ Performance and 
Why It Matters 

In pursuit of our first four research questions we draw answers from the performance 
of the TensorFlow-Based object detectors summarized in the ROC curves and the time 
tables of Chapter 5. We start by comparing the training requirements of deep CNNs to 
previous methods in terms of compute resources and training data required, while also 
considering what our results reinforce about deep CNNs’ versatility across datasets. We 
follow this discussion with another focusing on the tested algorithms’ ability to scale for 
different operating environments in the DOD. Finally, we reflect on the improvement of 
classification performance and potential implications for future DOD systems incorporating 
these open-source solutions. 

It is well known that Deep CNNs require more resources to train to convergence than legacy 
methods in terms of compute power and labeled data, yet their versatility across object 
detection problems makes them relevant to the DOD. We show this fact once again by 
considering the over one million training images from the ImageNet [1] dataset required to 
train the Inception models and the over 20,000 images we collected to train our screenshot 
detectors. In contrast, Jones’ [15], Camp’s [16], and Sharpe’s [17] algorithms required only 
6,806; 220; and 480 training images respectively. Still, our results affirm that the costs 
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associated with developing larger training sets and executing extra computations during 
training and evaluation prove worth paying. All three of the previous algorithm designers 
first studied the subjects of their detectors closely to understand the relevant features present 
in the objects they sought to detect. Then, they selected their feature extractors with care 
to match their specific classification problem. Next, they trained and tested many iterations 
of these feature extractors to optimize performance. Finally, they designed a data pipeline 
to run their images through the algorithms and produce a prediction. Deep CNNs do not 
demand such costly human involvement for each new use case. These algorithms instead 
incur a one-time design cost. Later, the algorithms only require sufficient training examples 
of the new object for detection and compute resources to repurpose the algorithm. For this 
reason, we were able to complete our experiments in the same amount of time each of the 
three previous algorithm designers had to construct a single object detector. With industry 
and academia paying these one-time-design costs for their open-source deep CNNs and the 
internet serving as growing source of training examples, the DOD can now, more than ever, 
afford to apply cutting-edge deep CNNs to meet a spectrum of its media analysis needs 
through open-source options. 

Organizations of the Department of Defense demand software solutions for media anal¬ 
ysis that function across a wide range of operating environments. Having discussed the 
steep training costs associated with TensorFlow-based object detectors, one might mistak¬ 
enly assume that these algorithms only provide value to groups with significant compute 
resources at their disposal. Our results show otherwise. Conducting costly training up 
front, or repurposing trained networks through efficient methods like transfer learning, the 
DOD can deploy trained deep CNNs in more austere environments for evaluating images. 
By exploring the time required to classify images in both ideal and resource-constrained 
configurations, we prove that TensorFlow-based object detectors can operate on currently 
available hardware across many DOD operating environments. 

The final and most significant conclusion from our empirical results lies in the classification 
performance improvements available through TensorFlow-based object detectors. Without 
any investment, in terms of development time or compute resources for training, Tensor- 
Flow’s latest Inception model offers increased performance over previous AK-47 detectors 
and also shows promise in helping the DOD identify objects in images across hundreds 
of other categories. Further, historical improvement across five iterations of the Inception 
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model suggests that Google will eontinue to update and release future versions ineorporating 
the latest advanees in the field. Looking at repurposing an Ineeption model for a new task 
with transfer learning, we found the model to aehieve near perfeet results on Camp’s [16] 
test set; inereasing reeall rates by 10 to 20 pereent aeross image seales while redueing false 
positive rates. With this performanee inerease, we also prove that deep CNNs persist in 
their applieability for DOD media analysis problems even when limited numbers of training 
examples exist. Finally, in retraining a deep CNN from serateh, we aehieved nearly identieal 
results to Sharpe [17] on her test set. While this result is not a performanee improvement, 
her task lends itself to human-seleeted feature extraetors. Features applieable for identify¬ 
ing sereenshots, like strong linear features, eontinuous patehes of pixel values, and text, are 
more obvious to a human eye than those required to distinguish more eomplex subjeets in 
natural photos. This trait of sereenshots faeilitated Sharpe’s [17] near perfeet classifieation 
performanee with an F-seore of 0.980. Coming in with a slightly lower F-seore of 0.975, 
the best deep CNN’s ability to learn applieable features without direet human input remains 
equally impressive. Aehieving a strong level of elassifieation performanee on this task, we 
illustrate that when possessing sufheient images the DOD ean apply proven algorithms to 
new problems by training deep CNNs from serateh. In summary, by attaining F-seores 
above 0.95 on all three previous test sets, we demonstrate the ability of open-souree deep 
learning solutions to assist members of the DOD aeross a speetrum of media analysis needs. 

We believe these results matter to the Department of Defense. The three previous exper¬ 
iments and ours span only seven years. In that time, maehine learning algorithms have 
matured from hand-erafted models for single-uses to the deep CNNs that sueeessfully gen¬ 
eralize aeross many unique image elassifieation problems. These algorithms’ elassifieation 
error rates are low enough to start implementing them in systems to faeilitate human- 
maehine teaming, where maehines ease an analyst like Clara’s workload and inerease her 
effeetiveness. The openness and eollaboration of aeademia and industry eontinues to spur 
progress more quiekly than individual groups eould aehieve. Google’s Ineeption stands as 
direet proof of this statement. Starting with the work of others, the eompany developed its 
own state-of-the-art model, entering it into the ILSVRC 2014 eompetition. Then inventing 
and sharing new methods, Google released three further iterations of the model. Finally, 
seeing an admirable idea from another industry leader, Mierosoft, Google ineorporated 
residual layers into the Ineeption model and attained its most aeeurate model to date, whieh 


91 



it also made public. Referring back to Figure 5.3, one sees that in our application of In¬ 
ception as an AK-47 detector, this open collaboration reduced error rates by 50 percent on 
Jones’ [15] test set between the first and last iteration. The DOD must find a way to include 
itself in this collaboration and bring to bear open-source progress in order to keep up with 
the current rate of change in the field of machine learning. 

6.2 Benefits and Risks of Employing TensorFlow in the 
DOD 

Working with multiple releases of TensorFlow for over a yearlong period, we strive to 
summarize our experience through an analysis of high-level risks and benefits of applying 
TensorFlow and deep CNNs in DOD work environments and defense systems. In setting 
out to accomplish this task, we also seek the answer to our fifth research question. 

Beyond providing access to powerful machine learning models discussed in Section 6.1, 
the benefits of TensorFlow include abstractions through the API that facilitate collaboration 
in solving machine learning tasks, continual updates and improvements to the software, 
and scalability across compute platforms. Because TensorFlow provides low-to-high levels 
of programming abstractions through its sub-libraries, like TensorFlow-Slim, users of the 
software can quickly apply proven algorithms to new problems with the high-level abstrac¬ 
tions or develop new algorithms with the low-level abstractions. These abstractions provide 
a framework for defining and discussing implementations of machine learning methods 
to facilitate collaboration in solving complex problems. A second benefit of the software 
is its pace of improvement. Where the DOD struggles to maintain its software libraries, 
TensorFlow stands out for rapid releases of updates. These include new functionality, itera¬ 
tions of algorithms with increased performance, TensorFlow implementations of the latest 
machine learning models from academia, updates to data pipelines and processes in pursuit 
of computational efficiencies, and more. Further benefits for the DOD are TensorFlow’s 
cross-platform availability and the scalability of the software. In our tests we employed 
TensorFlow on Linux Ubuntu, Windows 10, and CentOS operating systems and found our 
scripts and experiments to work well across the platforms. Officially, TensorFlow supports 
Ubuntu, Mac OS X, and Windows; but, as the source code is public, it is possible to 
compile TensorFlow for many more systems. TensorFlow community members already 
compile source code and share necessary files for installing TensorFlow on systems like 
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the Raspberry Pi. This leads us to the final benefit for diseussion, sealability. With several 
choiees of open-souree machine learning software frameworks, competition thrives among 
maintainers to make the most efficient and versatile implementations of proven algorithms 
and methods. The resulting efficiency and versatility of TensorFlow allows it to run on plat¬ 
forms as small as the Raspberry Pi while also scaling to run in production environments, 
like Google requires, through the TensorFlow Serving library. Also, Google offers cloud 
services with TensorFlow [70] running on hardware specialized for machine learning which 
could benefit larger organizations within the DOD. Such scalability encourages application 
of the software throughout all levels of the DOD. This limited discussion of the benefits 
of TensorFlow, coupled with the performance improvements discussed previously, should 
sufficiently motivate efforts to incorporate it in the DOD. 

Adopting TensorFlow in the Department of Defense does involve some risks and challenges. 
We introduce these here and discuss steps for overcoming them in Section 6.3. First, the 
software library requires an understanding of computer programming that the average- 
candidate users for the software in the DOD do not currently possess. Even with access to 
online tutorials and the prerequisite-programming skill, understanding TensorFlow’s data 
pipelines, queuing systems, and other processes requires significant study and effort. This 
proved especially true when attempting to import and train proven models with a new 
dataset. Where we discussed the library’s rapid updates as a benefit, a consequence of such 
innovation is that the software provides multiple options to conduct the same task. This leads 
to confusion as to which sub-library and which function calls to employ when developing 
code. These multiple options do serve two purposes: they allow users to complete tasks at 
different levels of abstraction or preserve functionality as legacy methods and procedures 
get updated, marked for removal, and eventually discontinued. Still, multiple options can 
become a source of confusion to users learning how to employ TensorFlow. A final challenge 
the DOD will face in using TensorFlow stems from its permissive Apache 2.0 License. The 
license allows anyone to modify the source code of TensorFlow and release it under a new 
license. Acquisition professionals seeking to incorporate TensorFlow into a DOD program 
will have to ensure that contractors do not make modifications to the software library, 
copyright the slightly modified version, and sell it to the DOD as a proprietary software 
library. These risks and challenges are not inconsequential; regardless, we believe the DOD 
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can accept and overcome them to reap the benefits available through open-souree software 
like TensorFlow. 

A final realization neeessary to responsibly apply deep CNNs in the DOD requires recog¬ 
nizing and appreciating the errors sueh algorithms make. As we showed in an analysis of 
eaeh objeet deteetors missteps in Chapter 5, deep CNNs do not make the same types of 
mistakes humans make. They are eapable of missing objeets obvious to a human analyst 
and also prove able to generate false positives in eases that would not deeeive the human 
eye. Appreeiating this faet narrows the types of problems with whieh these algorithms 
ean assist. Speeifically, eurrent deep CNNs do well filtering and prioritizing information 
for human review, serving on a human-maehine team, viee making independent deeisions 
whieh bear signifieant eonsequenee, like automatie weapons release. A benefit of initial 
human-maehine teaming is the ability of a human to identify patterns in an algorithms’ mis¬ 
takes, espeeially for false positives, and add miselassified examples to a training database 
with the goal of improving the algorithm’s performanee. For example, with a deep CNN 
AK-47 deteetor, a program would first sort images from a eonfiseated hard drive from the 
highest to lowest seore of eontaining a weapon. The human reviewer eould quiekly inspeet 
the highest-ranking images for aetionable intelligenee. When time permitted review of the 
remaining images, the analyst eould eomplete her inspeetion and add any identified false 
negatives to the algorithm’s training set. In this way it is possible to improve the AK-47 
deteetor in the future without requiring a eomplete redesign of the model. In applying deep 
CNNs to problems that faeilitate sueh human-maehine teaming, the DOD ean mitigate risk 
and gain an appreeiation for their strengths and weaknesses while eonsidering their potential 
for future independent applications. 

6.3 Next Steps for the DOD to Employ Open-Source 
Machine Learning Solutions 

In an effort to address some of the risks of using TensorFlow, and answer our final researeh 
question, this seetion explores potential next steps for DOD organizations hoping to benefit 
from open-souree maehine learning solutions. Our researeh eonfirms that TensorFlow offers 
a needed eapability to the Department of Defense, a first step in a more eomplex problem 
of providing aeeess to similar software for groups aeross the organization. Signifieant work 
remains to implement the teehnology and software we explored in this thesis for operational 
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use cases. A potential next step is to compare available libraries and select one or more for 
use on DOD systems and networks. Narrowing the selection to a few of the most promising 
frameworks mitigates risk of employing a sole framework, inspires further competition 
among open-source communities, and still encourages collaboration as DOD groups can 
share solutions they implement within the same open-source machine learning libraries. A 
second step is to make it known that the selected software is available to DOD users. While 
we showed that policy exists to allow open-source products in the DOD, the organization 
must inform personnel of their ability to employ the software and provide specific guidance 
about licensing considerations and terms of use. Further, information assurance policies 
must permit installation of the software on government owned systems; this may require 
certification of the software by DOD security professionals. 

Once given access to the software, personnel will require training. Thus, DOD organizations 
must develop a training pipeline to provide the prerequisite knowledge of a programming 
language, a basic understanding of the principles of machine learning, and introductory 
training on application specific algorithms, like deep CNNs, to gain an appreciation for 
their strengths and the types of mistakes they make. Fortunately, the required training 
topics are available through online courses and material [32], [71], [72] and the DOD could 
train personnel efficiently by developing certifications for specific personnel who complete 
a combination of online courses. These three steps can move the DOD closer to delivering 
the ability to apply open-source object detectors in media analysis tasks to intelligence 
analysts at many levels of the organization. 

As a potential strategy to reduce training requirements, the DOD could also contract a 
company to build and maintain a graphical user interface that makes calls to an open- 
source machine learning library. This project would require minimal software development 
but would allow everyday users access to the deep learning algorithms, like the Inception 
models, without requiring significant instruction. The interface could also allow users to 
apply transfer learning to the Inception models by supplying directories containing example 
images for new categories, in only contracting the company to provide the graphical user 
interface, the contractor could not make the underlying library proprietary and the cost of 
employing TensorFlow would remain low. Completing this step, everyday members of the 
DOD could deploy powerful deep learning solutions to new, organization specific, media 
analysis challenges. 
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Starting with the hypothesis of whether or not the Department of Defense should proeure and 
implement open-souree algorithms and software libraries, we developed researeh questions 
neeessary to explore this supposition. After seleeting three previous defense-related theses 
and datasets, we developed experiments to apply an open-souree maehine learning frame¬ 
work to the same images to ereate a baseline for performanee eomparison. We captured 
timing metrics for the training and evaluation of the deep learning algorithms and related 
our measurements to realistic scenarios to show that open-source deep CNNs will function 
across several DOD environments. We observed that TensorFlow-based object detectors 
outperform or match previous algorithm performance, while requiring significantly less 
effort to apply them to each new problem. These observations provide empirical evidence 
supporting future DOD procurement and application of open-source machine learning soft¬ 
ware. Beyond our results, we argue that benefits of open-source deep learning libraries 
outweigh associated risk and that it is logical for the DOD to employ them. In illustrating 
TensorFlow’s utility to the Department of Defense, we assert that the organization can 
benefit from the open-source innovations of industry and academia in the field of machine 
learning. Doing so affords the DOD an opportunity to increase efficiency and to keep pace 
with state-of-the-art advances in order to maintain a technological edge over its adversaries. 
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APPENDIX: Detailed Results 


A.l AK-47 Detector ROC Curves and Tables 


A.l.l Three Classification Scores Comparison 
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Figure A.l. Inception-vl ROC curve for three category scores on all test 
sets. 


- vl AR 

- vl rifle 

- vl AR + rifle 

O Max F-score: 0.86 {FPR:0.58%, TPR:80.35%) 
O Max F-score: 0.85 {FPR:0.70%, TPR:79.22%) 
O Max F-score; 0.86 (FPR:0.52%, TPR;80.10%) 



Table A.l. Inception-vl best F-score performance metrics all test sets. 


Model 

Thres. 

Acc. 

Free. 

FPR 

Rec. 

F-score 

vl AR 

0.047 

0.976 

0.938 

0.006 

0.804 

0.866 

vl rifle 

0.022 

0.973 

0.925 

0.007 

0.792 

0.853 

vl AR - 1 - rifle 

0.072 

0.976 

0.944 

0.005 

0.801 

0.866 
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A.2. lnception-v2 ROC curve for three category scores on all test 


Table A.2. lnception-v2 best F-score performance metrics all test sets. 


Model 

Thres. 

Acc. 

Free. 

FPR 

Rec. 

F-score 

v2 AR 

0.030 

0.984 

0.948 

0.005 

0.889 

0.917 

v2 rifle 

0.011 

0.982 

0.924 

0.008 

0.893 

0.908 

v2 AR + rifle 

0.040 

0.984 

0.940 

0.006 

0.894 

0.917 
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False Positive Rate 


Figure A.3. lnception-v3 ROC curve for three category scores on all test 
sets. 


Table A.3. lnception-v3 best F-score performance metrics all test sets. 


Model 

Thres. 

Acc. 

Free. 

FPR 

Rec. 

F-score 

v3 AR 

0.058 

0.983 

0.946 

0.005 

0.878 

0.911 

v3 rifle 

0.016 

0.969 

0.874 

0.013 

0.805 

0.838 

v3 AR + rifle 

0.108 

0.983 

0.956 

0.004 

0.869 

0.910 
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Figure 
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A.4. lnception-v4 ROC curve for three category scores on all test 


Table A.4. lnception-v4 best F-score performance metrics all test sets. 


Model 

Thres. 

Acc. 

Free. 

FPR 

Rec. 

F-score 

v4 AR 

0.063 

0.988 

0.982 

0.002 

0.890 

0.934 

v4 rifle 

0.019 

0.984 

0.944 

0.006 

0.885 

0.914 

v4 AR + rifle 

0.077 

0.987 

0.967 

0.003 

0.897 

0.931 


100 










False Positive Rate 


Figure A.5. lnception-ResNet-v2 ROC curve for three category scores on all 
test sets. 


Table A.5. lnception-ResNet-v2 best F-score performance metrics all test 
sets. 


Model 

Thres. 

Acc. 

Free. 

FPR 

Rec. 

F-score 

ResNet-v2 AR 

0.036 

0.989 

0.972 

0.003 

0.916 

0.943 

ResNet-v2 rifle 

0.011 

0.985 

0.942 

0.006 

0.906 

0.924 

ResNet-v2 AR + rifle 

0.037 

0.989 

0.957 

0.005 

0.926 

0.941 


101 








A.1.2 Jones’ Test Set Performance Comparison 



Figure A.6. AK-47 detector performance using the AR category for all In¬ 
ception models on Jones’ [15] test set (zoom in). 


Table A.6. Inception models’ best F-score performance metrics Jones test 
sets. 


Model 

Thres. 

Acc. 

Free. 

FPR 

Rec. 

F-score 

vl 

0.047 

0.983 

0.947 

0.005 

0.857 

0.900 

v2 

0.030 

0.990 

0.962 

0.004 

0.923 

0.942 

v3 

0.125 

0.989 

0.987 

0.001 

0.889 

0.936 

v4 

0.053 

0.992 

0.988 

0.001 

0.926 

0.956 

ResNet-v2 

0.036 

0.993 

0.985 

0.001 

0.939 

0.961 
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A. 1.3 All Test Set Performance Comparison 



False Positive Rate 


Figure A.7. Best AK-47 detector performance using the AR category for 
internet test sets. 


Table A.7. lnception-ResNet-v2’s best F-score performance metrics compar¬ 
ing Jones and internet test sets. 


Image Set 

Thresh. 

Ace. 

Free. 

FPR 

Rec. 

F-score 

Jones Video Frames 

0.036 

0.993 

0.985 

0.001 

0.939 

0.961 

Internet Images 

0.002 

0.956 

0.980 

0.014 

0.916 

0.947 

Internet/Rifle-Like 

0.004 

0.911 

0.833 

0.078 

0.888 

0.860 
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A.2 Ship Detector ROC Curves and Tables 
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Figure A.8. Ship detector performance across all Camp’s test set scales 
retrained with only Camp’s training images. 


Table A.8. Ship detector best F-score performance metrics on Camp’s test 
set at all scales. 


Image Set 

Thresh. 

Acc. 

Free. 

FPR 

Rec. 

F-score 

Full Scale 

0.517 

1.000 

1.000 

0.000 

1.000 

1.000 

75% Scale 

0.406 

0.998 

0.998 

0.010 

1.000 

0.999 

50% Scale 

0.362 

1.000 

1.000 

0.000 

1.000 

1.000 

25% Scale 

0.426 

1.000 

1.000 

0.000 

1.000 

1.000 

20% Scale 

0.410 

0.996 

0.998 

0.010 

0.998 

0.998 

15% Scale 

0.327 

0.992 

0.995 

0.020 

0.995 

0.995 

10% Scale 

0.090 

0.976 

0.980 

0.082 

0.990 

0.985 

5% Scale 

0.067 

0.952 

0.966 

0.143 

0.975 

0.970 
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A.3 Screenshot Detector ROC Curves and Tables 



False Positive Rate 


Figure A.9. Screenshot detector all models’ performance on all Sharpe’s [17] 
images. 


Table A.9. Screenshot detector best F-score performance metrics on 
Sharpe’s images all models. 


Model 

Thresh. 

Acc. 

Free. 

FPR 

Rec. 

F-score 

VGGNet-16(30K) 

0.097 

0.956 

0.967 

0.021 

0.921 

0.944 

ResNet-v2-50 (30K) 

0.074 

0.958 

0.970 

0.019 

0.922 

0.946 

Inception-ResNet-v2 (30K) 

0.055 

0.964 

0.983 

0.010 

0.926 

0.954 

VGGNet-16 (50K) 

0.721 

0.963 

0.986 

0.009 

0.919 

0.952 

ResNet-v2-50 (50K) 

0.056 

0.956 

0.989 

0.007 

0.899 

0.942 

Inception-ResNet-v2 (50K) 

0.897 

0.981 

0.995 

0.003 

0.956 

0.975 
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